Week 7 (4/6-4/10) - Switching Models, Bayesian Results, and Finally Prepping the LLM Inputs
April 10, 2026
Welcome back to my blog! This week marked a major turning point in the project: wrapping up the baseline models, resolving the last of the environment issues, and finally getting all of the input data ready to train the LLM.
To start, I confirmed the GPT-OSS installation, which came with its own set of debugging hurdles. I had accidentally created a virtual environment inside another one while troubleshooting last week, which caused a confusing chain of issues. The main problem turned out to be that my file was still running on the old kernel, so even after remaking the environments and fixing the version issues, nothing was actually updating until I restarted it. On top of that, I had to downgrade NumPy to a compatible version. After all of that, though, I ran into a more fundamental problem: running a model of that scale locally on my Mac would be painfully slow. So, I pivoted to using the HuggingFace interface instead, which allows me to access larger models with the limited computational power of my computer.
One thing I realized was that with the free version of HuggingFace, larger models are at a higher risk of timing out. This was a large motivation in why I switched from GPT-OSS to BioMistral as my initial LLM, a significantly smaller model that was specifically pre-trained on biomedical literature. Given the domain of this project, this feels like a much more natural fit than a general-purpose model, and its size makes it far more practical to work with given my computational constraints.
I also finally have the Bayesian logistic regression results to share. After running cross-validation, the overall macro F1 score came out to 0.1936 ± 0.0454. As I suspected, this is roughly the same as the logistic regression and Random Forest. This conclusively confirms what the past two weeks have been pointing to: classical machine learning models, regardless of how they are tuned or structured, simply do not have the capacity to capture the complexity of this data. While disappointing, it is also exactly the motivation for the rest of this project. The LLM approach is a necessity to build a potentially useful clinical decision support model.
With the baseline chapter now closed, I shifted my focus to preparing the inputs for BioMistral. The first major task was figuring out how to parse and format CREST for fine-tuning, which I plan to do in Google Colab (since fine-tuning can’t easily be done locally on my Mac). The CREST files are stored as XML with HTML-formatted text, so I parsed through the files to extract the key components from each: the title of the paper, the developer (the organization that published it), the recommendation grade, and the actual recommendation text. These were saved into a unified dataframe.
When inspecting this dataframe, I realized that the recommendation grading system was not automatically unified. CREST pulls from many different sources, each using their own classification scheme, making it impossible to compare recommendations across guidelines without standardization. However, the CREST folder included a schemes.xml file that mapped each scheme ID to a general strength category. I went back into my file parsing loop to extract the scheme ID alongside each recommendation and wrote a function to apply this mapping, normalizing everything into a single system: strong, weak, moderate, consensus (meaning generally considered true among clinicians), and inconclusive. I dropped inconclusive recommendations, as they would not be useful or reliable inputs for clinical decision support.
The other big realization I had this week was clarifying how MIMIC-IV and CREST actually fit together in the pipeline, since I had been thinking about them somewhat interchangeably. They are actually serving two distinct stages. MIMIC-IV will help BioMistral learn to go from patient symptoms to a diagnosis, while CREST (paired with RAG) will support the second stage: going from a diagnosis to treatment recommendations, complete with the accompanying citation and strength of the recommendation. I also started exploring whether CREST could support the first stage too. Since many clinical guidelines do mention patient presentations, contraindications, and symptoms within the recommendation text, I could potentially use BioMistral to extract those symptoms from the CREST text, and then incorporate them into the first stage as well. This is something I plan to dive into first thing next week. I’d love to hear your thoughts or suggestions on using an LLM to extract structured information (like symptoms or contraindications) from unstructured clinical text.
For MIMIC-IV specifically, I worked on structuring the data as a text prompt for the model. The plan is to pass all numerical features as properties in the prompt of the model, while the textual values (like medications or chief complaints) will be incorporated directly into the prompt itself. Now that both dataframes (from CREST and MIMIC-IV) are formatted and ready, all of the input data is finally in place.
Next week, I will turn both dataframes into text prompts, make a final decision on whether to extract symptoms from CREST for the first stage, and train BioMistral on both pipeline stages. I plan to integrate SHAP and RAG into the first and second stage of the pipeline respectively. If you’ve used SHAP (or any other interpretability tool) on an LLM before, I’d love to hear your experiences and what tools you used, as well as if there is anything I should be prepared for. Stay tuned to see my progress!
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Hi Aanya! Great work this week! I found it very inspiring that you were able to concretely define what roles MIMIC-IV and CREST would play in your project this week alongside more progress with your LLM overall. It’s great to see progression despite previous obstacles that you shared here on your blog. Looking forward to next week!
Hi Elin, thank you so much!
Hi Aanya, over the past few weeks, I’ve had issues with Colab in training some of my models for a large number of epochs, often leading to runtimes being disconnected and progress lost. Do you think this could happen in your case, and if so, have you made any plans to manage it?
Hi Anav, great question! Yes, I am a bit concerned about this happening. If I run into these issues, I actually plan to consider the approaches you took when you faced them (including possibly buying Colab Pro) along with doing some of my own research to see what possible fixes there are. I will definitely ask in Week 9’s blog for advice if I run into any such issues!
Great progress this week! The pivot you made midway through seems like it’ll really pay off in the long run, especially since the new approach is such a natural fit for your project. And figuring out how the two datasets serve different purposes feels like a breakthrough moment will hopefully make everything much smoother. Excited to see how next week goes!