Understanding Discovar output
This post is part 10 of a series on bioinformatics file formats, written for the 2017 UK-KBRIN Essentials of Next Generation Sequencing Workshop at the University of Kentucky.
DISCOVAR is a new genome assembler and variant caller developed by the broad institute. As of this writing,it takes as input Illumina reads of length 250 or longer produced on MiSeq or HiSeq 2500. To learn more, read the entire DISCOVAR manual here.
The assembly output
The final output assembly will take the form of
OUT_HEAD is set by the user. Below let’s assume we’ve set
OUT_HEAD to out.
DISCOVAR also generates a number of intermediate assembly files, named
The final assembly is a graph, the edges of which are contained in two fasta files:
fasta0 contains the non-overlapping edges, whereas
out.final.fasta extends the edges to overlap by K-1 bases.
Which you choose to use will depend on your specific needs. In either case, the FASTA header generated takes the form of:
edge-name start-node:end-node k-size edge-size
>edge_10 1:100 K=80 bases=330
Note that the
fasta0 file does not have a k parameter and is omitted.
Visualizing the output graph
The assembly graph can be visualized using GraphViz using the below commands:
dot -Tps -o assembly.final.ps assembly.final.dot gv assembly.final.ps
The resulting visualization will show each edge with its edgeID color coded by length.
Example visualization of a DISCOVAR edge, from the DISCOVAR manual.