Week 3: Data Exploration
March 26, 2026
This blog post will explain the work I have done over the past 3 weeks.
The first thing I did was prepare my dataset. The dataset I’m using is from NanoBaseLib, a benchmark dataset for nanopore sequencing tasks. After downloading and extracting it, the raw signal data is stored in a multi-read FAST5 file. The signals themselves are stored as 16-bit integers and represent raw ionic current samples. I wrote code to load these signals and convert them to picoamperes using the calibration metadata stored alongside each read. The signals ranged from roughly 53 to 141 pA, with a median read length of about 35,000 samples. What stood out to me right away was how noisy the data is. The signal has a clear step-like structure at a high level, where each step corresponds to a different sequence of bases passing through the pore, but within each step, there is substantial noise, around 3 to 4 pA, that makes the boundaries between steps hard to pin down. Even in the zoomed-in graphs, the transitions are not clean.
I also looked at the eventalign output from Nanopolish, which is the pre-computed alignment of signal segments to reference k-mers. This is the ground truth that basecallers try to learn. Looking at the distribution of event-level mean currents and event durations gave me a better sense of how much variation there is between reads. Some events are very short while others are long, and there is a wide range of current levels across different k-mers.
For week two, I focused on benchmarking existing basecallers to give myself a target to compare against. The main one I tested was Dorado, which is Oxford Nanopore’s current production basecaller. I ran Dorado version 0.9.6 on the demo dataset using Apple Metal acceleration (since I am working on a Mac), and it finished in about 285 seconds. Dorado produced results for 4,574 reads with a mean Q-score of 34.5, which corresponds to an estimated accuracy of around 99.96%. Guppy, which is the older ONT basecaller, had a lower mean Q-score of 18.7 and about 99% accuracy.
During Week three, I built four baseline models, all trained using CTC loss, which is the standard approach for basecalling. CTC stands for Connectionist Temporal Classification, and the reason it is used here is that the input signal and the output base sequence have different lengths, and we do not know which signal samples correspond to which bases. CTC handles the alignment implicitly during training, which makes it a natural fit for this problem. The first model was a simple fully connected neural network. The ANN converged to a CTC loss of around 3.9 and basically just predicted the same base for every input.
The second model was a 1D CNN with four convolutional layers and a kernel size of 11, meaning each prediction had context from about 11 neighboring samples. This was significantly better, finishing with a validation loss of around 1.63. The third was a bidirectional LSTM, which processes the entire signal in both directions and can capture longer-range dependencies. It reached a similar validation loss to the CNN, around 1.63. The fourth model was a CNN plus LSTM hybrid. A convolutional front-end extracts local features from the signal, and the LSTM processes the resulting feature sequence to capture longer context. This performed best of the four, ending at a validation loss of about 1.27 on the training set and 1.28 on validation. It also had the smallest parameter count of the non-ANN models, at around 56,000 parameters and 220 KB, compared to the CNN and LSTM, which were both over 500 KB.

Leave a Reply
You must be logged in to post a comment.