Week 1 (2/23-2/27) - Welcome to my blog!
February 27, 2026
Welcome everyone! I’m Aanya, and this is my blog bringing you along my journey of creating an LLM, or large language model, for clinical decision support (specifically in immunology). First, let me give you some background on why I chose this project in the first place!
The potential for a model like this is amazing, and working towards building one is crucial because they could significantly improve our healthcare system. Currently, 47% of healthcare workers are burning out and there is an expected shortage of up to 124,000 physicians in the US alone by 2034. A new tool like this model would relieve the burden on the healthcare system and its workers. Especially in high-stress situations, such an LLM could improve diagnostic precision and reduce error rates by helping physicians narrow down potential causes based on patient symptoms, reducing hospital stays and accelerating recovery.
What am I even attempting to create? An LLM, or a large language model, is a type of natural language processing (NLP) model built from neural network architectures. Just like any machine learning model, it needs to be trained. In an LLM’s methodology, there are two critical phases: pre-training and fine-tuning. To summarize simply, the pre-training phase is where models learn by processing extensive amounts of unlabeled textual data, using different techniques to gain a deep understanding of the data. By the end, it will not only understand the meaning of, but also the context between the concepts that consist of the data it has been trained on (oftentimes different words and phrases).
To build a well-functioning LLM, massive amounts of general textual data and computational power are needed. Thus, I plan to apply an open-source LLM, GPT-OSS, that has already been pre-trained. To specialize it for our context, I will take advantage of the second phase: fine-tuning. Here, the model’s parameters are optimized using task-specific, labeled datasets of much more limited scale (like the one I will use in this project), enhancing performance on these tasks.
Even more specifically, our LLM will have a transformer model structure. This is defined by a self-attention mechanism: an encoder that processes the input data and encodes it into embeddings, which are then used by a decoder to understand the relationships among the data and generate output. By weighing the importance of different parts of the input, transformer models are able to process large amounts of highly interconnected information, which is highly useful with the complexity of medical data. You’ve probably heard of at least one famous transformer in computational biology – AlphaFold2, which predicts 3D protein structures from amino acid sequences by combining known information about preexisting protein structures and amino acid sequences. Similarly, there are models in genomics and single-cell data, like scGPT and Geneformer that are used for cell-type annotation, multi-omic integration, and investigating the behavior of gene regulatory networks.
Now, some of you might be wondering if LLMs are so prominent these days, why hasn’t such a model already been created? The research for large language models in clinical decision support is still highly limited (though promising!). BioGPT and Med-PaLM, designed by Microsoft and Google, have been trained extensively to achieve extremely high performance on medical exams (roughly 80%). However, they lack the application of clinical support, mostly used against question-answer tasks and medical exams rather than real-world data.
So, coming back to my main point, my goal with this project is to build an interpretable large language model that identifies potential diagnoses and suggests next steps for treatment based on all the available patient data. Interpretability is a key requirement and highlight of this project, important to emphasize in such a sensitive field. We should clearly be able to understand the model’s reasoning (how it is making its decision) and avoid the black-box issue common in machine learning. As these can be life-and-death situations, the aim will continue to be building a model for decision support, not meant to replace physicians with people’s lives in the balance.
With this plan, I started on my project. My plan for the first week involved reading through previous literature on building and processing a database of clinical practice guidelines for fine-tuning my model. I delved into a paper I had found promising, Developing search strategies for clinical practice guidelines in SUMSearch and Google Scholar [1].
My goal was to understand how to effectively search for medical papers (specifically called clinical practice guidelines or CPGs) that would form the database for fine-tuning my model. One key takeaway was the MeSH system. MeSH, or Medical Subject Headings, is the National Library of Medicine (NLM)’s established vocabulary used to search over the biomedical and health-related journals in MEDLINE and PubMed, allowing for much more precise and structured search that can highly improve search accuracy. The paper compared the use of MeSH terms + common terms for CPGs in two different search engines, SUMSearch and Google Scholar and found the common term “guideline” to be the best option. However, the outdated nature of the paper made it hard for me to trust the results. In fact, SUMSearch is no longer accessible, and so I looked into similar, more recent papers to see what progress had been since then and if more popular search engines/databases had appeared since then [2, 3, 4, 5].
However, this topic isn’t really explored, and it was difficult to find papers that addressed the same question. After reviewing a few papers, I found that phrasing searches in variations of the PICO format, a well-known standard for medical searches, are the most common. The only new search database I found mentioned in this literature (besides the original PubMED) was MacPLUS. PICO stands for population, intervention, control, and outcome, and it proves to be a useful way to effectively find CPGs as it provides specific, clear questioning. Here’s an example with each part defined:
In adults with type 2 diabetes (P, population), is intermittent fasting (I, intervention/treatment) more effective than a standard diet (C, control) in improving glycemic control (O, outcome)?
MeSH terms continued to be a prominent theme in my literature review, using boolean terms (and, not, or) to combine them with the search format. I knew then my next step is to go back to my own data and understand the patient data to pull or translate clear MeSH terms that I could then find CPGs for.
So, I introduce you to our dataset: the MIMIC-IV Demo Dataset. It is a set of de-identified electronic health records publicly available for analysis, recording the visit data of patients admitted to the emergency department or ICU for immunological problems in the Deaconess Medical Center. Through five different files/tables, it records demographic data, admission details, medication records, lab measurements, charting, procedures, and physician notes.
Investigating the data has introduced my first set of obstacles in this project. Each stay recorded has an ICD title. ICD, or International Classification of Diseases, is another system created by WHO to classify diagnoses, symptoms, and procedures. This leads to my first question: How should I convert the various ICD titles (there are 288 unique ones in the data!) to MeSH terms easily?
Thus, a main goal for me next week is to research the existence of systems or keys that have already done this, to avoid doing this manually. This would require a lot of biological research that could derail the timeline of my project. Another problem that arose was the existence of multiple ICDs for the same stay. When I found rows with the same stay ID, I investigated further and found that the difference was in ICDs, meaning that a patient was given multiple ICDs during their visit. This is my second obstacle of the week, and an aim for next week: understanding how to best process the data for the LLM so that the interconnected nature of the multiple ICDs, as they are based on the same symptoms, is understood.
Let me know your thoughts on how to tackle these issues! Next week, I will address these two issues and complete my search for the CPGs, finalizing how to find them and how they will be processed into the LLM.
[1] Haase, A., Follmann, M., Skipka, G., & Kirchner, H. (2007). Developing search strategies for clinical practice guidelines in SUMSearch and Google Scholar and assessing their retrieval performance. BMC medical research methodology, 7, 28. https://doi.org/10.1186/1471-2288-7-28
[2] Atkinson, L. Z., & Cipriani, A. (2018). How to Carry out a Literature Search for a Systematic Review: a Practical Guide. BJPsych Advances, 24(2), 74–82. https://doi.org/10.1192/bja.2017.3
[3] L. Martínez García, Sanabria, A., Araya, I., Lawson, J., I. Solà, Rwm. Vernooij, D. López, E. García Álvarez, Mm. Trujillo-Martín, I. Etxeandia-Ikobaltzeta, A. Kotzeva, D. Rigau, A. Louro-González, L. Barajas-Nava, Campo, del, Estrada, M., Gracia, J., F. Salcedo-Fernandez, Haynes, R., & P. Alonso-Coello. (2015). Efficiency of pragmatic search strategies to update clinical guidelines recommendations. BMC Medical Research Methodology, 15(1). https://doi.org/10.1186/s12874-015-0058-2
[4] Mancin, S., Sguanci, M., Andreoli, D., Soekeland, F., Anastasi, G., Piredda, M., & Grazia, M. (2023). Systematic Review of Clinical Practice Guidelines and Systematic Reviews: A method for conducting comprehensive analysis. MethodsX, 12(1), 102532–102532. https://doi.org/10.1016/j.mex.2023.102532
[5] Seguin, A., Haynes, R. B., Carballo, S., Iorio, A., Perrier, A., & Agoritsas, T. (2020). Translating Clinical Questions by Physicians Into Searchable Queries: Analytical Survey Study. JMIR Medical Education, 6(1), e16777. https://doi.org/10.2196/16777
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Really interesting start to your project! The technical stack you’ll be using looks impressive, and the project itself seems like it could be very useful in addressing real-world problems. I did have one question: would it be possible to combine the text-based LLM with image-based data (medical imagery) to create a vision-language model (VLM) for some (not all) of the conditions a patient could have? Looking forward to seeing how this project progresses.
Hi Anav! Thank you so much for your question! It’s a really interesting point I didn’t consider earlier. While I don’t think I will able to do this in the scope of my project over the next few months, I definitely think this could be a possibility, and it’s something we should make sure to explore in the future as we continue to implement and refine these types of models.
I think the goal of this project is really important. As we saw during the pandemic, the healthcare system plays a crucial role in emergency response, and the burnout experienced by healthcare workers can have major implications since their work is both constant and essential. I’m curious about how you plan to test the system. Will you be using real-world data, or will you simulate patients to evaluate how the system performs? I’m also wondering about the ethical considerations of using an LLM, particularly regarding the privacy and protection of patient information. Overall, I’m really interested to see how this project develops!
Hi Archita! Thank you so much, I totally agree. To test my system, I plan to use both methods. I will set aside a portion of my data that will not be involved in the training of the model, and also find simulated patient cases online. Fortunately, there are not any privacy concerns around patient information, as the entire dataset I am using is de-identified and compliant with all federal guidelines.