This post is part 10 of a series on bioinformatics file formats, written for the 2017 UK-KBRIN Essentials of Next Generation Sequencing Workshop at the University of Kentucky.
DISCOVAR is a new genome assembler and variant caller developed by the broad institute. As of this writing,it takes as input Illumina reads of length 250 or longer produced on MiSeq or HiSeq 2500. To learn more, read the entire DISCOVAR manual here.
The final output assembly will take the form of OUT_HEAD.final.*
, where OUT_HEAD
is set by the user. Below let's assume we've set OUT_HEAD
to out.
DISCOVAR also generates a number of intermediate assembly files, named out.n.*
.
The final assembly is a graph, the edges of which are contained in two fasta files: out.final.fasta0
and out.final.fasta
. fasta0
contains the non-overlapping edges, whereas out.final.fasta
extends the edges to overlap by K-1 bases.
Which you choose to use will depend on your specific needs. In either case, the FASTA header generated takes the form of:
edge-name start-node:end-node k-size edge-size
For example:
>edge_10 1:100 K=80 bases=330
Note that the fasta0
file does not have a k parameter and is omitted.
The assembly graph can be visualized using GraphViz using the below commands:
dot -Tps -o assembly.final.ps assembly.final.dot
gv assembly.final.ps
The resulting visualization will show each edge with its edgeID color coded by length.
Example visualization of a DISCOVAR edge, from the DISCOVAR manual.