Alignment

Working directory
FASTQ
Reads QC
Alignment
Alignment QC

Working directory

Open your terminal, and go to your working directory,

cd /mnt/albasj/analysis/username

and create the following directory structure:

mkdir stats
mkdir alignment
mkdir variant_calling

FASTQ

The data we will be using is a subset of specific regions of chromosome X in a FASTQ format. FASTQ format is a text-based format for storing both a biological sequence and its corresponding quality scores.

A FASTQ file normally uses four lines per sequence: 1) Begins with a ‘@’ and is followed by a sequence identifier 2) Is the raw sequence letters 3) Begins with a ‘+’ character 4) Encodes the quality values for the sequence in Line 2

You can visualize the FASTQ file typing:

less -S /mnt/albasj/data/nanopore/fastq/child.nanopore.ROI.fastq

How many reads do we have?

awk '{s++}END{print s/4}' /mnt/albasj/data/nanopore/fastq/child.nanopore.ROI.fastq

Reads QC

First we will get the read length for each read:

awk '{if(NR%4==2) print length($1)}' /mnt/albasj/data/nanopore/fastq/child.nanopore.ROI.fastq > stats/read_length.txt

And look at the read length distribution. For that, you can start R from the command-line:

and then, type the following:

library(ggplot2)
readLength <- read.table("stats/read_length.txt", header=FALSE, col.names = "length")
ggplot(data=readLength, aes(length)) + geom_histogram()

Hint: for quitting R, just type, quit()

Alignment

The standard format for aligned sequence data is SAM (Sequence Alignment Map).

SAM files have a header that contains information on alignment and contigs used, and the aligned reads:

But because SAM files can be large, they are usually stored in the compressed version of them, BAM files.

Multiple algorithms have been developed to align long reads to a genome of reference. Some examples are:

Graphmap: http://github.com/isovic/graphmap
bwa mem -x l ont2d: http://github.com/lh3/bwa
LAST: http://last.cbrc.jp
NGMLR: http://github.com/philres/ngmlr
minimap2: http://github.com/lh3/minimap2

Here we will use NGMLR. First we will map the reads to the genome of reference (GRCh37)

ngmlr -r /mnt/albasj/data/reference/GRCh37/Homo_sapiens.GRCh37.75.dna.fasta \
 -q /mnt/albasj/data/nanopore/fastq/child.nanopore.ROI.fastq \
 -o alignment/child.nanopore.ROI.sam

and convert the SAM output to BAM format.

samtools view alignment/child.nanopore.ROI.sam \
 -O BAM \
 -o alignment/child.nanopore.ROI.bam

Then, we will sort it by mapping coordinate and save it as BAM.

samtools sort -o alignment/child.nanopore.ROI.sort.bam alignment/child.nanopore.ROI.bam

Finally we will index the BAM file to run samtools subtools later.

samtools index alignment/child.nanopore.ROI.sort.bam

To visualise the BAM file:

samtools view -h alignment/child.nanopore.ROI.sort.bam | less -S

Alignment QC

As a first QC, we can run samtools stats:

samtools stats alignment/child.nanopore.ROI.sort.bam > stats/stats.txt

and look at the first 40 lines:

head -n40 stats/stats.txt

How many reads were mapped?
Which was the average length of the reads? And the maximum read length?

Now we will get the coverage per base using samtools depth.

samtools depth alignment/child.nanopore.ROI.sort.bam > stats/coverage.txt

And look at the coverage distribution in R. For that, you can start R from the command-line:

and then, type the following:

library(ggplot2)
coverage <-  read.table("stats/coverage.txt", header=FALSE, col.names = c("chrom", "pos", "cov"))
cov_percent <- data.frame(  "cov" = seq(1,max(coverage$cov)) 
                          , "percent" = sapply(seq(1,max(coverage$cov)), function(x) nrow(coverage[coverage$cov >= x,])/nrow(coverage)))
p <- ggplot(cov_percent, aes(x = cov, y = percent)) + 
     geom_line() + 
     scale_x_continuous(breaks=seq(0,max(coverage$cov), 1)) + 
     xlab("Coverage") + 
     ylab("Percentage of bases")
p

You can also add a vertical line to the previous plot intercepting the median coverage:

p + geom_vline(xintercept=median(coverage$cov), colour = "red")

NGSchool 2018

Complex Structural Variants in Mendelian Disorders

Alignment

Working directory

FASTQ

Reads QC

Alignment

Alignment QC