Week 4: Creating a list of Algorithms to Test
March 27, 2026
This week, I worked on figuring out what compression and optimization techniques I should try.
The technique I plan to try first is structured pruning with delayed masking, which comes from a 2024 paper by Frensel, Al-Ars, and Hofstee that I covered in my literature review. The idea is to train the model to eliminate entire neurons that are not contributing much, using a sparsity penalty that is introduced gradually, so the model has time to learn before you start cutting it. They got a 21x reduction in model size with almost no accuracy drop. However, everything in their paper was tested on GPUs, not on low-power CPUs, which is exactly the gap I am trying to fill.
I also looked into quantization, which reduces the precision of the weights from 32-bit floats down to 8-bit integers. Dorado already does this internally, and on ARM hardware like the Raspberry Pi, it is well-supported. I want to test post-training quantization first because it is fast, and then try quantization-aware training if accuracy drops too much. Another technique I plan to test is knowledge distillation, where you train a small student model to mimic the outputs of a larger teacher model, which tends to preserve more accuracy than just compressing the model directly.
Beyond compression, I also looked at architecture changes that could make the model faster from the start. The most promising one is replacing standard convolutions with depthwise separable convolutions, which can cut parameter count in the convolutional layers by about 9x. This has been shown to work well for time-series and audio tasks, but has not been tried for nanopore basecalling. Another option is replacing the LSTM entirely with a temporal convolutional network, which uses dilated convolutions to capture long-range context without the sequential bottleneck that makes LSTMs slow on CPUs.
Reader Interactions
Comments
Leave a Reply
You must be logged in to post a comment.

Hi Charan,
Nice work! I am really excited to see how your project goes! I did have one question though. How well does your structured pruning approach translate to low-power CPU environments? Since your experiments were conducted on GPUs, I’m curious whether the sparsity patterns you learned (especially with delayed masking) actually result in measurable speedups on CPUs, or if the gains are mostly in model size rather than inference latency.