Blog #8: MORE DATA :D

April 27, 2025

Heya!

We’re now on a countdown to the last blog post.

This week, I revisited prompt engineering, specifically chain-of-thought. That’s just a fancy way of saying I added “Let’s think step by step” to the prompt, in hopes of encouraging intermediate steps that lead to better accuracy.

After running my fine-tuned model with this improved prompt, I have yet another spreadsheet of model outputs. From inspecting the first ten responses, they don’t seem significantly different, though I can’t know for sure until I grade them.

Throwing back to Blog #7: remember how the fine-tuned model performed better on more represented diseases? While that’s good and might imply the model learned from the data, the accuracy for less represented diseases didn’t improve as much.

To combat this lack of data, I went looking for data. I couldn’t find many real clinical cases online, but while playing around on the Orphanet website, I discovered Orphadata’s “Phenotypes Associated with Rare Diseases” XML file. One thing I found interesting: instead of just listing the phenotypes for each rare disease, it also states how frequently each phenotype is expressed. I cleaned and reformatted the data a bit, added it to my training data, and ran fine-tuning. With the test data, I again asked the model to generate responses.

I’ll probably grade the two sheets of model responses this weekend.

Thinking of catching these smaller cases made me think of using a retrieval-augmented generation (RAG), so I dived into another research rabbit hole. Essentially, the model retrieves a relevant document from a database and refers to it while answering, usually increasing accuracy and lowering hallucination. Though the aim of my project is just fine-tuning, I’m kinda curious how using RAG on top of the fine-tuned model would turn out…

Welp. Art is never finished, only abandoned. Starting next week, I’ll begin wrapping it up, building a website, and working on the poster.

Thanks for reading!

– Luoxi

View more of Luoxi W.'s posts.

Blog #8: MORE DATA :D

Reader Interactions

Leave a Reply Cancel reply