Basecalling

Instructors

1) Aims

Overall: How to take Oxford Nanopore (ON) data from raw signal to called variants

In this part of the session we will learn about Data formats , Basecalling and Basic Quality Control(QC)

2) Starting point

img_1

3) Raw reads

All nanopore data is written to a specific run directory. Change to the example directory below…

cd /mnt/albasj/data/nanopore/

… and list it to see what is there

ls 

Lets take a look in the reads/ directory. This is where the sequencer writes raw data.

ls reads/ 

You should see two directories: (0/ and 1/)

ON machines write reads in batches (here batch size is 4000) This is to keep the number of files in each directory resonable for the computers filesystem

ls reads/0/ 

4) FAST5 files

Nanopore writes read data to a file format they call FAST5

1 read = 1 .fast5 file

FAST5 files are infact HDF5 files. These are compressed binary files which store data in a structured way, allowing fast random access to subsets of the data.

This is where electronic signal data from the sequencer is storred

img_2

Lets look at the structure of a FAST5 file using h5ls

h5ls reads/0/GXB01206_20180518_FAH88225_GA50000_sequencing_run_CD3_92236_read_9998_ch_295_strand.fast5

This shows the top level data keys.

We can view the subkeys by recursivley listing the file

h5ls -r reads/0/GXB01206_20180518_FAH88225_GA50000_sequencing_run_CD3_92236_read_9998_ch_295_strand.fast5

We can also dump and view the entire contents of a FAST5

h5dump reads/0/GXB01206_20180518_FAH88225_GA50000_sequencing_run_CD3_92236_read_9998_ch_295_strand.fast5 | less 

(Hint 1: use the mouse wheel to scroll up and down the file) (Hint 2: press q to exit less, a text reading program)

5) Basecalling

This is the process of translating raw electrical signal data from an ON sequencer to DNA sequence. Basecalling is a critical step in the analysis workflow as poor basecalling makes poor sequence data.

base

Many basecallers exist - but for now we will be using Albacore v2.3.3 developed by ON

For basecalling it is important to know which Flow Cell and Library Prep Kit was used. To see all the combinations which Albacore can handle try:

read_fast5_basecaller.py -l 

So lets basecall

username=_yourusername_

read_fast5_basecaller.py --flowcell FLO-MIN106 \
 --kit SQK-PCS108 \
 --input reads/ \
 --recursive \
 --worker_threads 4 \
 --save_path /mnt/albasj/analysis/${username}/basecalled_reads/

(Interesting fact: You have just started up a Machine Learning script. Albacore, alongside almost all current nanopore basecallers have a neural network at their core)

5.i) A comment about basecallers

As previously mentioned many basecallers are available.

The main performance marker of a basecaller that we care about is the overall Assembly Identity (how much a final alignment matches the reference)

img_3

We can also take a look at the assembly length bias which tells us if a given basecaller is prone to reference insertions or deletions

img_4

(Further reading: A great comparison of basecallers exists here: basecaller_comparison

6) Recommended workflow

There are two different workflows we recommend based on which sequencer you have available.

img_5

7) Basecalling results

Lets take a look inside the new basecalled_reads/ directory we just created

cd /mnt/albasj/analysis/${username}
ls -1 basecalled_reads/

Important directory: workspace/

ls basecalled_reads/workspace/

This contains two sub-directories pass/ and fail/ which contain the basecalled read data in fastq format (ready for use with common aligners like BW).

pass/ obviously contains all of the reads which can be used in further downstream analysis and fail/ contains reads which are un-useable

ls basecalled_reads/workspace/pass/

To see the basic format of a .fastq file run the command below. fastq’s will be covered further in tomorrows session

less basecalled_reads/workspace/pass/fastq_runid_eeb92128ecdaf98ba8cd29e26976e99b3843f88e_0.fastq

(Hint: press q to exit less, a text reading program)

Important file: sequencing_summary.txt

less basecalled_reads/sequencing_summary.txt

(Hint: press q to exit less, a text reading program)

This file can be used to plot basic run statistics as part of QC

8) Basic run QC

It is much easier to view this data in R.

Open R:

R

and execute the following commands:

library(ggplot2)
summary_data <- read.table( "basecalled_reads/sequencing_summary.txt", sep="\t", header=TRUE)
ggplot(data=summary_data, aes(sequence_length_template)) + 
 geom_histogram()

You can also represent the length template by qscore template:

ggplot(data=summary_data, aes(x=mean_qscore_template, y=sequence_length_template, color=passes_filtering)) +
 geom_point()

Hint: to quit R, type: quit()

Final Comment

This was a VERY basic overview of nanopore data analysis. Below is a diagram showing the parts of an overall workflow this tutorial has covered.

img_6