Spatiotemporal attention boosts calling of complicated variations from long reads’ alignment data

Publisher:
Association for Computing Machinery (ACM)
Publication Type:
Conference Proceeding
Citation:
ACM-BCB 2024 - 15th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2024, pp. 1-10
Issue Date:
2024-12-16
Full metadata record
With the latest Q20 technology, the base error rate in ONT long-read sequencing has been down to 1%. However, such rates of sequencing errors (base insertions, deletions or substitutions) still lag behind the 0.1% base error rate in NGS short reads, resulting in many complicated variation regions in the full alignment data of deeply sequenced long reads and posing a big challenge to germline variant calling. For example, current deep learning methods could misidentify 20,000 to 30,000 variants from the ONT long reads basecalled by Q20 on a single chromosome, or could misidentify more than 30,000 at the complicated variation regions when the reads basecalled by Guppy v5.0.14. We proposed a spatiotemporal attention deep learning method (Attdeepcaller) to boost the performance of variation calling on these complicated variation regions. The novel use of spatiotemporal attention is to modulate the confusion between genuine sequencing errors and the true germline variations in the alignment data so that the identification by the algorithms becomes clear at most cases. As tested on the complicated regions in the alignment data basecalled by Q20 on chr1 of HG002, Attdeepcaller made only 22,739 misidentifications, reduced by 12.69% from current 26,043 misidentifications. Similarly, the misidentification number is reduced by 16.49% on HG003 and by 23.58% on HG004 compared with the current best. When tested on the Guppy 5 alignment data, Attdeepcaller improved the precision by 3 percent and the recall by 1 percent on the complicated variation regions. We also conducted comparative analysis of these methods on old versions of guppy data. Specifically on the Guppy v3.4.5 datasets, Attdeepcaller boosted the precision by a jump of 16 percent and improved the recall by 10 percent. This result suggests that Attdeepcaller can still work and can work substantially better when the reads alignment data becomes more complicated (the older the version of basecalling, the higher the base error rate of the sequencing data, and the more complicated the alignment data is).
Please use this identifier to cite or link to this item: