Week 12 (5/11 - 5/15) - Final Takeaways: Reflecting on the Project and Making Final Fixes!
May 15, 2026
Welcome back to my blog! This is the last post before the final presentation, and I’m excited to walk you through my last week.
The main technical goal this week was addressing a critical flaw I had identified in the pipeline after receiving feedback on the demo outputs: the final recommendations were very off for all five patients.
After investigating further, I was able to pinpoint exactly where things were breaking down. In the original pipeline, Stage 1 was generating retrieval queries by just restating the predicted ICD titles, producing things like “guidelines for weakness, dehydration.” These vague queries caused FAISS to default to whatever was most represented in the CREST index, completely ignoring the actual patient condition.
The fix was to replace those vague queries with proper clinical ones. I used the Groq API (specifically Llama) to generate MeSH-based boolean queries for each MIMIC patient using all the patient information from the note and the predicted ICD and administered medications. This meant things like (“Central Venous Catheter Infections” OR “Catheter-Related Infections”) AND (“Antibiotic Therapy”). These became the retrieval query labels in the Stage 1 training data, so that when BioMistral fine-tunes on them, it learns to output specific clinical terminology instead of vague restatements.
The before and after results were pretty striking. After retraining Stage 1 on the improved labels, 4 out of 5 patients received at least partially relevant recommendations. Patient 5 (laryngitis) received a recommendation that included azithromycin, which exactly matches the ground truth medication. Patient 3 (dyspnea) received a COPD guideline that explicitly recommended investigating BNP and troponin to rule out congestive heart failure, which is the actual diagnosis.
That said, the results are still far from perfect, and I think it’s important to be honest about that. Patient 2’s foot infection still retrieved pressure ulcer guidelines despite the improved query, which points to a separate issue: the CREST index itself is imbalanced, with certain guideline categories heavily overrepresented. Patient 4 was also tricky. The recommendations weren’t completely irrelevant, as they picked up on something real in the patient’s medication history. However, they weren’t safe or actionable for what the patient was actually presenting with, which is what matters most in a clinical setting.
Beyond the technical fixes, I also spent a significant chunk of this week rehearsing and refining my slides. Stepping back and looking at the project as a whole, I think the biggest thing I’m taking away is how much you learn from building something that doesn’t fully work. The baseline models failing completely turned out to be the clearest possible evidence that classical ML wasn’t the right tool for this problem, and that made the strongest argument for an LLM approach. Finding out that the retrieval query was the bottleneck, and being able to diagnose it and fix it was also really satisfying. I came into this project thinking the hardest part would be the model itself. It turned out the hardest parts were the data pipeline (as I spent weeks researching how to find and process clinical practical guidelines in the context of my project), and the retrieval design given the limits on the scope of my project, both in terms of computational power and data available.
Stay tuned – I will be posting the recording of my final presentation and my full GitHub on Sunday!
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Hi Aanya, it’s great to see that you were committed to improving your project as much as possible, and were able to balance that and working on the final presentation. Nice work!