Eremid Genomics and Catawba College -- Illumina Next Generation Sequencing Workshop
Useful Links
- Geneious Prime Free Trial
- FastQC
- Sequence Data Files (Box Drive)
- Windows Subsystem for Linux
- GeSeq
- BV-BRC Taxonomic Classification Service
- CIPRES Science Gateway
- Galaxy Webserver
- Miniconda Install
Unix/Linux Command Line Plastome Assembly
Software
Install
Install conda if you haven’t already. Following are instructions for Linux and Mac – pick correct one for your computer. Follow the on-screen prompts using default settings.
# MacOS ARM CPU
mkdir -p ~/miniconda3
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -o ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
# MacOS Intel CPU
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o ~/miniconda3/miniconda.sh
# Linux
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
Intialize
~/miniconda3/bin/conda init bash
~/miniconda3/bin/conda init zsh
Restart your terminal session for it to take effect.
conda install -c bioconda -c conda-forge fastp spades getorganelle novoplasty iqtree matplotlib
Inspect Data
Trim adapters, discard low quality reads with fastp.
fastp -i R1.fastq.gz -I R2.fastq.gz -o R1.fastp.fastq.gz -O R2.fastp.fastq.gz -5 -3 --detect_adapter_for_pe -p --html SAMPLE.html --json SAMPLE.json
Inspect fastp html file, check for any abnormalities.
(Optional) Taxonomic classification to check for bacterial contaminants. https://www.bv-brc.org/app/TaxonomicClassification
Method 1: Reference-based filtering followed by de novo assembly
Get a complete reference sequence (FASTA format) from NCBI GenBank or other source.
Index reference sequence with Bowtie2.
bowtie2-build reference.fasta reference.fasta
Map reads to reference, saving mapped reads to new files. Adjust -p
to match your computer’s # CPUs. Using --very-sensitive-local
enhances short matches for more distantly related species.
bowtie2 -p 12 -x reference.fasta -1 R1.fastp.fastq.gz -2 R2.fastp.fastq.gz --al-conc-gz R%.aln.fastq.gz --very-sensitive-local &> /dev/null
Assemble mapped reads with SPAdes. Change -t
to match your computers # of CPUs. Change -o
to whatever the sample name is.
spades.py -1 R1.aln.fastq.gz -2 R2.aln.fastq.gz -t 12 -o SAMPLE_NAME --careful
Check the scaffolds.fasta
file in the output folder. If successful, there should be three large scaffolds, approx. 100k, 25k, and 15k in length, representing the LSC, SSC, and IR regions. Manually stitch them together in Geneious based on the reference.
If more and/or smaller scaffolds are produced, try mapping them to the reference in Geneious to see what’s missing.
Method 2: Completely de novo with getOrganelle
Input all adapter-trimmed reads into getOrganelle. To set up getOrganelle for the first time:
# This downloads all databases for different organisms and organelles. See help menu/documentation to download certain ones.
get_organelle_config.py -a all
Run getOrganelle for an embryophyte plant plastid. See documention for other organism and organelle options.
get_organelle_from_reads.py -1 R1.fastp.fastq.gz -2 R2.fastp.fastq.gz -F embplant_pt -t 12 -o output_dir
Inspect output files. Hopefully you’ll get a complete circular sequence with filename like: embplant_pt.K115.complete.graph1.1.path_sequence.fasta
.
Remap reads to assembled plastome to evaluate.
evaluate_assembly_using_mapping.py -f embplant_pt.K115.complete.graph1.1.path_sequence.fasta -1 R1.fastp.fastq.gz -2 R2.fastp.fastq.gz -t 12 -o evaluate_output --draw
Check coverage levels plotted in mapping.pdf
. “Matched” level should be pretty even (though noisy) across LSC/SSC, maybe be different in IRs.
Method 3: Completely de novo with NOVOPlasty.
NOVOPlasty requires a starting sequence (FASTA format) that closely matches your organism. Find a sequence for a conserved gene (e.g. rbcL) from GenBank or another source. Copy the file into your working directory.
//TODO