Updates to my genome scaffolding method
Last Updated: 2025-01-26
Outline
Following the genomes published in 2025 (link), I sequenced new hornwort genomes with either Oxford Nanopore R10 or PacBio HiFi data. Combined with the update to hifiasm that allows it to scaffold R10 ONT data, I generated much higher quality initial assemblies with several T2T contigs. Hi-C scaffolding by Phase Genomics resulted in most chromosomes being composed of just 2-3 large contigs.
However, comparing the structural collinearity of male vs. female individuals within species – including some plants that were siblings from the same parent – revealed large rearrangements. While possible, this seemed more likely to be an issue with Hi-C scaffolding rather than true chromosome translocations. I wanted to assess the accuracy of all the scaffolded genomes, assuming the newer ones were more likely to be correct. To do so, I came up with this procedure, starting with the higher quality R10/HiFi genomes:
- Annotate telomere sequences in each scaffolded genome and its contig-level assembly.
- Locate telomeres within the scaffold genome.
- If telomeres are not at scaffold ends, inspect Hi-C heatmap for contigs that can be moved to place telomeres at ends without contradicting data.
- If some telomeres are missing from scaffolds, check the contig-level assembly to see if they were trimmed by Phase.
- If telomeres were trimmed during scaffolding, rescaffold the whole genome with the contigs, then reinspect the Hi-C heatmap for conflicts.
Using this process, I was able to create T2T scaffolds for most of the high quality genomes, bringing them to the expected n = 5 chromosomes. Then I repeated the process for the lower quality genomes of the same species as the high quality ones.
- Rescaffold the low quality contig-level assembly to the high quality scaffolds.
- Check telomere locations and Hi-C heatmap of the rescaffolded low quality genome.
- Break/move contigs as supported by the data.
Iteratively following this process, the lower quality genomes were significantly improved, though not perfect. Telomeres were not well assembled with ONT R9 data, so most assemblies are lacking enough telomere-containing contigs to construct T2T scaffolds. In some cases, telomere sequence was assembled in the middle of contigs and lacking Hi-C support to call these misassemblies. But despite some remaining uncertainty, the new assemblies were improvements, with scaffold numbers matching expected chromosome number and intraspecific chromosome structure more consistent.
Data and Dependencies
Data
- Long-read DNA, >30X coverage
- Short-read Hi-C, ~100X coverage
- Reference genome (for guided-rescaffolding)
- Short-read polyA-enriched or ribo-zero RNA, ~7 Gb