Week 3: Look What You Made Me Do
March 24, 2023
Look What You Made Me Do
Dear reader… if you’re still here, I thank you for fearlessly fending off the urge to run away and never look at this blog again. To be honest, it’s nice to have a friend through this roller coaster ride of a project. Speaking of which, let’s get into this week’s updates!
Quick Recap
As I discussed last week, I’ve unfortunately been consistently running out of RAM during the stain deconvolution process. After watching my tears ricochet as I kept trying (and failing) to solve this issue, I was left with 3 options:
- Try running stain deconvolution on a smaller glioblastoma slide from TCGA (my dataset), since slide size is directly proportional to the amount of memory required to process it.
- Separate whole slides into several smaller images (called “patches”) through patch extraction techniques and run stain deconvolution on individual patches to significantly decrease memory usage.
- Implement raw stain deconvolution algorithms from scratch (which requires a lot of matrix math, so I’m not too keen about this option but it’s still a last resort in case the above options aren’t successful).
Now, let’s discuss my progress in determining which option is the best route to take for my project.
Option 1: Using a Smaller Slide
Initially, I was using a slide about 300 MB large to test the stain deconvolution algorithm. For some context, most slides tend to be around 700 MB-1.5 GB, so this slide was already on the smaller side. Here’s a picture of the slide I was using:
However, as I discussed in my last blog post, I ran out of RAM while processing this slide. I even tried switching to a high-RAM notebook with the best GPU Google Colab offers, to no avail. And then I thought: well, I’m gonna try this on the smallest slide in the entire dataset. And so I did. After going through all of the available slides in TCGA’s glioblastoma dataset, the smallest slide I found was 6.46 MB, which is…very small. Here’s the slide:
I went into the stain deconvolution process yet again, fully expecting there to be no memory issues given the slide’s size. But long story short, just minutes later, guess what I saw on my screen?
So it goes…
The happiness I was experiencing upon finding this very small slide and anticipating positive results had vanished. At this point, I was honestly a bit frustrated and had no other course of action but to stop what I was doing, switch gears, and try my second option. It was time to begin again.
Option 2: Patch-Level Stain Deconvolution
The biggest advantage I have with patch-level stain deconvolution is that I’m processing much smaller (256 x 256 pixel) images as opposed to relatively larger whole slide images. Because of this, I went in with the hope that my memory issues would largely be alleviated. However, it’s important to bear in mind that this is in no way a perfect solution: as my external advisor mentioned while I was discussing this option with her, there is a chance that I could face the formation of artifacts when stitching the processed patches back together into the original image. This could very likely mess with the results of experiments that use slides passed through my stain deconvolution algorithm.
With my disclaimer out of the way, let’s move on to my implementation of patch-level stain deconvolution and my evaluation of the results. Here are the steps I followed:
- Clone the CLAM GitHub repository into my Colab notebook and install the openslide-tools library, which is required for CLAM to function properly. CLAM is a specialized whole slide image processing toolkit developed by Harvard researchers. I’ve used it before in my work with the researchers who created CLAM, so I already have some prior experience that proved incredibly useful while working on this project. In other words, this toolkit is nothing new (ok I admit that was a bit forced but it still fits) to me.
- Define my settings for patch generation and execute patch extraction from my chosen slide. Here, a patch level of 0 means I’m using the highest magnification level available in the slide (since SVS files are structured in a pyramid of successively decreasing magnification levels). A patch size of 256 will result in image patches being 256 pixels wide and 256 pixels long, and a step size of 256 will prevent image redundancy/multiple slides from containing the same image features. Finally, the last three settings are directories that the patch extraction algorithm needs to source/store necessary data.
- Visualize a random patch that was just extracted from the slide. CLAM is unique in that it stores patches as slide image coordinates in the HDF5 file format instead of storing the image patches themselves. This saves a LOT of storage space, as I’m essentially storing numbers instead of images.
As demonstrated below, I first wrote code to view some sample coordinates stored in the HDF5 file. Then, I picked a set of coordinates, cropped the section of the original slide corresponding to those coordinates, and visualized the resulting image. During this process, I also had to convert the raw SVS slide into the TIFF format since SVS is a proprietary imaging format and faces compatibility issues with many libraries, making the slide untouchable in its original format.
Also note that on the second image, x and y are the coordinates themselves and h and w are the height and width of the patch (256 pixels for both).
- Perform stain deconvolution on my selected patch. Spoiler alert: I (at long last) didn’t run into RAM issues!! Here are my results:
Albeit the resulting images being a bit blurrier than I had expected, patch-level stain deconvolution was a success! Look what you made me do, incredibly annoying memory issues :))
While analyzing these images, I noticed that they contained weird lines and random changes in clarity (most visible in the center area of the hematoxylin image). I’ll have to take a closer look at that in the future to figure out what exactly is causing those areas to look the way they do, especially since as far as I can tell, they aren’t present on the raw patch image.
After discussing these results with my external advisor, I have a few next steps that I’ll be undertaking in the coming week:
– Change libraries and try stain deconvolution with HistomicsUI (an alternative library to HistomicsTK, the current tool I’m using for stain deconvolution) to see whether it yields better results (images with less blurriness and without the artifacts I was describing above)
– Select a new slide with pen markings and attempt to run stain deconvolution on two patches from that slide, where one patch exhibits the pen markings and the other doesn’t, and compare my results for both patches
– Apply this to a variety of patches from other, larger slides since minimizing slide size is no longer a significant issue now that I’m conducting stain deconvolution on the patch level instead of the whole slide level.
That’s it for now–time to go bask in the afterglow of this week’s successes. Thanks for sticking around, and see you next week :))
Citations
- “The Cancer Genome Atlas.” Genomic Data Commons, National Cancer Institute, Https://Portal.Gdc.Cancer.Gov/.
- Lu, M.Y., Williamson, D.F.K., Chen, T.Y. et Al. Data-Efficient And Weakly Supervised Computational Pathology On Whole-Slide Images. Nat Biomed Eng 5, 555–570 (2021). Https://Doi.Org/10.1038/S41551-020-00682-W
- “Christoon Cartoons.” Christoon Cartoons, Https://Christoon-Cartoons.Carrd.Co/. (Twitter: @ChristoonC)