Guide to newblr output

This post is part 8 of a series on bioinformatics file formats, written for the 2017 UK-KBRIN Essentials of Next Generation Sequencing Workshop at the University of Kentucky.

Introduction

Newbler is a software package for de novo assembly of 454 sequencing data. It is available for free from Roche as part of the GS de novo Assembler package. While Roche may have discontinued support for it, you may still find yourself working with output from this assembler.

It outputs many files. I've excerpted the description from the manual, and a table of the column labels and their meaning for each file.

Assembly project.xml

Tab-delimited file giving position-by position consensus base and flow signal information.

We will not be going over this file, as its intended to be machine-readable.

454AlignmentInfo.tsv

The 454AlignmentInfo.tsv file contains position-by-position summary information about the consensus sequence for the contigs generated by the GS De Novo Assembler application, listed one nucleotide per line (in a tabdelimited format). Output conditionally (using the -info/-infoall/-noinfo options or the selection made on the GUI Parameters Tab Output Sub-tab). By default, this file is only output if there are fewer than 4 million input reads and the total length of assembled contigs is less than 40 Mbp. For larger projects, -info or -infoall or the corresponding GUI Output selection for Alignment Info must be used to generate this file.

column Label Description
1 Position Position in the contig
2 Consensus Consensus nucleotide at that position
3 Quality Score Quality score of consensus base
4 Unique Depth Number of unique reads at that position
5 Align Depth Number of reads at that position
6 Signal Average signal of the read flowgrams for the flows that correspond to that position
7 StdDeviation Standard deviation of the read flowgram signals at that position

454readstatus.txt

Tab-delimited text file providing a perread report of the status of each read in the assembly. The 3’ and 5’ positions of each read’s alignment within the contig are also reported.

Column Label Description
1 Accno Accession number of the input read
2 Read status Status of the read in the assembly. Can be 'assembled', 'partially assembled', 'singleton', 'repeat', 'outlier', or 'too short (discarded)'
3 5' Contig Accession number of the contig where the 5' end of the read aligns
4 5' Position The position in the 5' contig where the 5' end begins
5 5' Strand Orientation of the read relative to the 5' contig.
6 3' Contig Accession number of the contig where the 3' end of the read aligns
7 3' Position The position in the 5' contig where the 5' end begins
8 3' Strand Orientation of the read relative to the 5' contig

trimstatus.txt

Tab-delimited text file providing a perread report of the original and revised trim points used in the assembly.

Column Label Description
1 Accno Accession number of the input read
2 Trimpoints used Portion of the read used (ie, 2-301)
3 Trimmed length Final length
4 Orig. trimpoints The original trimpoints specified in the input
5 Orig. trimmed length Original trimmed length of the read
6 Raw length Length of the read with no trimming

pairstatus.txt

Tab-delimited text file providing a perpair report of the location and status of how each paired end pair of reads were used in the assembly

Column Label Description
1 Template Template string for the pair (454 accession for 454 paired end reads)
2 Status Same Contig: pairs are on same contig as expected. Link: pairs are on different contigs, but could perhaps link these contigs. All other statuses are failures.
3 Distance Distance between linked reads
4 Left contig Contig where the left half assembled
5 Left pos 5' position on contig
6 Left dir Forward + reverse -
7 Right contig Contig where the right half assembled
8 Right pos 5' position on contig
9 Right dir Forward + reverse -
10 Left distance Distance from Left pos to the end of the contig
11 Right distance Distance from Right pos to the end of the contig

454PairAlign.txt

A text file giving the pairwise alignment(s) of the overlaps used in the assembly computation (only produced when using the –pair option [or –pairt option for the tab-delimited version of the file]).

Column Label Description
1 Query Accno accession number of the read
2 Query start Starting position of alignment in query
3 Query end Ending position of alignment in query
4 Query length Length of the query
5 Subj accno Accession number of subject sequence
6 Subj start Starting position of alignment in subject
7 Subj end Ending position of alignment in subject
8 Subj length Length of the subject
9 Num ident Number of matches in alignment
10 Align length Length of the total alignment
11 Query align Sequence of the query in the alignment
12 Subj align Sequence of the subject in the alignment

454NewblerMetrics.txt

File providing various assembly metrics, including the number of input runs and reads, the number and size of the large consensus contigs as well as all consensus contigs

Please see the manual for a full description of whats included in this file.

contigs.ace

ACE format file that can be loaded by third-party viewer programs that support the ACE format. The output can be a single file for the entire project or a folder containing individual files for each contig in the assembly

This file is not necessarily intended to be human readable. Please see the manual for a full description of how to interpret this file.

contiggraph.txt

The 454ContigGraph.txt file contains a graph-based description of the branching structure of an assembly’s contigs, where nodes represent the contigs and edges between contigs give the branching structure. When paired end reads are included in the assembly, the file also shows scaffold edges, describing how the contigs have been linked together into scaffolds. The entries in the file are grouped by type, of which there are six, but not all projects will contain entries for every type. Two types of entries(“S” and “P”) are only present when paired end reads are included in the assembly.

This file is not necessarily intended to be human readable. Please see the manual for a full description of how to interpret this file.