This post is part 8 of a series on bioinformatics file formats, written for the 2017 UK-KBRIN Essentials of Next Generation Sequencing Workshop at the University of Kentucky.
Newbler is a software package for de novo assembly of 454 sequencing data. It is available for free from Roche as part of the GS de novo Assembler package. While Roche may have discontinued support for it, you may still find yourself working with output from this assembler.
It outputs many files. I've excerpted the description from the manual, and a table of the column labels and their meaning for each file.
Assembly project.xml
Tab-delimited file giving position-by position consensus base and flow signal information.
We will not be going over this file, as its intended to be machine-readable.
454AlignmentInfo.tsv
The 454AlignmentInfo.tsv file contains position-by-position summary information about the consensus
sequence for the contigs generated by the GS De Novo Assembler application, listed one nucleotide per line (in a tabdelimited format). Output conditionally (using the -info/-infoall/-noinfo options or the selection made on the GUI Parameters Tab Output Sub-tab). By default, this file is only output if there are fewer than 4 million input reads and the total length of assembled contigs is less than 40 Mbp. For larger projects, -info or -infoall or the corresponding GUI Output selection for Alignment Info must be used to generate this file.
column | Label | Description |
---|---|---|
1 | Position | Position in the contig |
2 | Consensus | Consensus nucleotide at that position |
3 | Quality Score | Quality score of consensus base |
4 | Unique Depth | Number of unique reads at that position |
5 | Align Depth | Number of reads at that position |
6 | Signal | Average signal of the read flowgrams for the flows that correspond to that position |
7 | StdDeviation | Standard deviation of the read flowgram signals at that position |
454readstatus.txt
Tab-delimited text file providing a perread
report of the status of each read in the assembly. The 3’ and 5’ positions of each read’s alignment within the contig are also reported.
Column | Label | Description |
---|---|---|
1 | Accno | Accession number of the input read |
2 | Read status | Status of the read in the assembly. Can be 'assembled', 'partially assembled', 'singleton', 'repeat', 'outlier', or 'too short (discarded)' |
3 | 5' Contig | Accession number of the contig where the 5' end of the read aligns |
4 | 5' Position | The position in the 5' contig where the 5' end begins |
5 | 5' Strand | Orientation of the read relative to the 5' contig. |
6 | 3' Contig | Accession number of the contig where the 3' end of the read aligns |
7 | 3' Position | The position in the 5' contig where the 5' end begins |
8 | 3' Strand | Orientation of the read relative to the 5' contig |
trimstatus.txt
Tab-delimited text file providing a perread
report of the original and revised trim points used in the assembly.
Column | Label | Description |
---|---|---|
1 | Accno | Accession number of the input read |
2 | Trimpoints used | Portion of the read used (ie, 2-301) |
3 | Trimmed length | Final length |
4 | Orig. trimpoints | The original trimpoints specified in the input |
5 | Orig. trimmed length | Original trimmed length of the read |
6 | Raw length | Length of the read with no trimming |
pairstatus.txt
Tab-delimited text file providing a perpair
report of the location and status of how each paired end pair of reads were used in the assembly
Column | Label | Description |
---|---|---|
1 | Template | Template string for the pair (454 accession for 454 paired end reads) |
2 | Status | Same Contig: pairs are on same contig as expected. Link: pairs are on different contigs, but could perhaps link these contigs. All other statuses are failures. |
3 | Distance | Distance between linked reads |
4 | Left contig | Contig where the left half assembled |
5 | Left pos | 5' position on contig |
6 | Left dir | Forward + reverse - |
7 | Right contig | Contig where the right half assembled |
8 | Right pos | 5' position on contig |
9 | Right dir | Forward + reverse - |
10 | Left distance | Distance from Left pos to the end of the contig |
11 | Right distance | Distance from Right pos to the end of the contig |
454PairAlign.txt
A text file giving the pairwise
alignment(s) of the overlaps used in the assembly computation (only produced when using the –pair option [or –pairt option for the tab-delimited version of the file]).
Column | Label | Description |
---|---|---|
1 | Query Accno | accession number of the read |
2 | Query start | Starting position of alignment in query |
3 | Query end | Ending position of alignment in query |
4 | Query length | Length of the query |
5 | Subj accno | Accession number of subject sequence |
6 | Subj start | Starting position of alignment in subject |
7 | Subj end | Ending position of alignment in subject |
8 | Subj length | Length of the subject |
9 | Num ident | Number of matches in alignment |
10 | Align length | Length of the total alignment |
11 | Query align | Sequence of the query in the alignment |
12 | Subj align | Sequence of the subject in the alignment |
454NewblerMetrics.txt
File providing various assembly metrics,
including the number of input runs and reads, the number and size of the large consensus contigs as well as all consensus contigs
Please see the manual for a full description of whats included in this file.
contigs.ace
ACE format file that can be loaded by
third-party viewer programs that support the ACE format. The output can be a single file for the entire project or a folder containing individual files for each contig in the assembly
This file is not necessarily intended to be human readable. Please see the manual for a full description of how to interpret this file.
contiggraph.txt
The 454ContigGraph.txt file contains a graph-based description of the branching structure of an assembly’s contigs,
where nodes represent the contigs and edges between contigs give the branching structure. When paired end reads are included in the assembly, the file also shows scaffold edges, describing how the contigs have been linked together into scaffolds. The entries in the file are grouped by type, of which there are six, but not all projects will contain entries for every type. Two types of entries(“S” and “P”) are only present when paired end reads are included in the assembly.
This file is not necessarily intended to be human readable. Please see the manual for a full description of how to interpret this file.