Week 9: Clinicogenomic datasets
May 26, 2026
This week I read a lot of interesting literature to inform my project. Below is my interpretation of the three most relevant papers I read. I stepped back from the newest extraction work to read the older foundations my pipeline depends on, since the hard parts of turning notes into labels were mostly named long before large language models arrived.
The first was Harkema et al. in the Journal of Biomedical Informatics, which introduced ConText, an algorithm for deciding whether a condition in a clinical note is negated, hypothetical, historical, or about someone other than the patient. This changed how I understand where my extraction errors come from. I had been thinking of the task as finding irAEs in the text, and ConText splits that into two genuinely different problems, noticing that a condition is mentioned and judging whether it is actually asserted of this patient now. Seeing those as separate makes me realize a pipeline can look accurate at finding mentions while being wrong about assertion in a way that would inflate my case counts, and that the second problem is the one that actually threatens my GWAS.
The second was the Snorkel paper by Ratner et al. on weak supervision, the idea of writing many noisy labeling functions and modeling their agreement to produce probabilistic training labels. This reframed how I think about my LLM’s output. I had been treating the model as a label I either trust or do not, and weak supervision recasts it as one noisy voter among several, no different in kind from an ICD code or a medication-discontinuation signal. The deeper shift is in how I think about a label itself, since it stops being a hard fact about a patient and becomes a probability with disagreement baked into it, which is a more honest representation of how uncertain note-derived phenotypes really are.
The third was Zehir et al. in Nature Medicine, the foundational MSK-IMPACT paper, describing prospective sequencing of more than 10,000 patients with matched tumor and normal tissue. Reading where my dataset came from changed how I think about its limits. The matched normal is the reason germline analysis is possible at all, and understanding that the panel was built to find somatic tumor mutations across a few hundred cancer genes reframes my germline scan as borrowing a tool designed for something else. It also makes the cohort feel less like a neutral sample of patients and more like a specific population, enriched for advanced and previously treated cancers, which shapes how much I should generalize anything I find and reminds me the dataset was never assembled with my question in mind.

Leave a Reply
You must be logged in to post a comment.