Velvet assembly output files
This post is part 4 of a series on file formats, written for the 2017 UK-KBRIN Essentials of Next Generation Sequencing Workshop at the University of Kentucky.
The velvet manual is hosted online in wiki format. You can find the section on output files here. I’ll be including relevant quotes from the manual. I also found this site helpful, and some of the examples are taken from here. Ultimately the stats files included with the assembly can be more confusing than helpful. Don’t be too concerned if you don’t understand all of the metrics.
Below is a brief guide to the files included in the output.
From the manual:
This fasta file contains the sequences of the contigs longer than 2k, where k is the word-length used in velveth. If you have specified a min_contig_lgth threshold, then the contigs shorter than that value are omitted.
This file describes the nodes in the assembly and will look like the below example.
A table describing the headers is below
|Column header||Full name||Description|
|lgth||length||length (in k-mers)|
|out||3’ arcs||Number of arcs 3’|
|in||5’ arcs||Number of arcs 5’|
|long_cov||coverage short 1||coverage in long reads|
|short1_cov||coverage short 1||coverage in short reads (including divergent sequences)|
|short1_Ocov||coverage short 2||coverage in short reads (conform to consensus only, strict)|
|short2_cov||coverage short 2||coverage in short reads (including divergent sequences)|
|short2_Ocov||coverage short 2||coverage in short reads (conform to consensus only, strict)|
|long_nb||long reads in node||number of long reads in node|
|short1_nb||short 1 reads in node||number of short1 reads in node|
|short2_nb||short 2 reads in node||number of short2 reads in node|
This file is meant to be read by the AMOS assembly package, so we will not be concerned with it.
This file describes the graph produced by velvet to create the assembly.
There is one header line for the entire graph, which lists the number of nodes, number of sequences, and the total hash length.
Then, each node has a block with the following format:
NODE $NODE_ID $COV_SHORT1 $O_COV_SHORT1 $COV_SHORT2 $O_COV_SHORT2 $ENDS_OF_KMERS_OF_NODE $ENDS_OF_KMERS_OF_TWIN_NODE
These should look familiar to whats in the
stats.txt file. The ends of K-mers values are the last nucleotides of the k-mers in that node.