OBITools: processing DNA metabarcoding data

Introduction

The OBITools software is a set of tools specifically designed for processing Next Generation Sequencing data in a DNA metabarcoding context, taking into account taxonomic information. It is distributed as an open source software available on the following website: http://metabarcoding.org/obitools.

Data description: a bit of biology

Deoxyribonucleic acid (DNA) is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic information for development of all living organisms.

By convention, we write and read a DNA sequence from 5' to 3'. To describe a whole DNA molecule, we write the sequence of the positive sense strand (the negative antisense strand sequence can be deduced from the positive sense sequence as it is the reverse complementary).

Metabarcoding context

Determining the order of the nucleotides composing a DNA molecule is known as DNA sequencing. In metabarcoding context we want to target barcode sequence exclusively. Barcode are DNA sequence of gene from mitochondrion with the following property:

conservative region         hypervariable region            conservative region

===================|||||||||||||||||||||||||||||||||||||||||===================

The conservative region is typical of the taxonomic group we want to target (e.g. teleostei). The conservative region will be used as primer to initiate the PCR. The hypervariable region is the barcode sequence itself, it permits to identify at species level. In metabarcoding, barcode sequences from a sample are targeted and ampliflied by PCR using primer. Then the isolated DNA barcodes are pooled with barcode sequences from other samples. A tag sequence of 8 nucleotides is added to each DNA fragments from a given sample. So that it will be possible to assign the sample of a given DNA fragment in downstream analysis.

Next generation sequencing

DNA sequencing is performed on the library of amplified DNA barcodes. The sequencing of illumina instrument is paired-end. Adaptors are annealed to the ends of DNA fragments. Fragments bind to primer-loaded flow cell and bridge PCR reactions amplify each bound fragment to produce clusters of fragments. During each sequencing cycle, one fluorophore attached nucleotide is added to the growing strands. Laser excites the fluorophores in all the fragments that are being sequenced and an optic scanner collects the signals from each fragment cluster. Then the sequencing terminator is removed and the next sequencing cycle starts.

The NGS technique produces millions of short sequences (typical read length of 125 bp)

FASTA files

For FAST Alignment. Universal format to describe DNA sequences.

>ID1 ; a fish
ATCGAAAAAGGGTAAATAAAAG
ATCGAAAAAGGGTAAATAAAAG
ATCGAAAAAGGGTAAATAAAAG
GGTAAA​
  • The row starting with a > indicates information related to the DNA sequence. ID of the sequence ID1. Relative information a fish
  • The sequence is splitted into row of 80 characters and a jumpline marks the end of the sequence. A FASTA file can stores many sequences. These sequences are generally aligned.

FASTQ files

FASTQ stands for FASTA Quality. They are raw data produced by sequencing instrument. Each sequence is a read. For each DNA sequence, a quality of sequencing sequence is associated.

@HISEQ:250:CADV5ANXX:5:1101:1832:1959 1:N:0:ATCACGATTCTTTCCC
TCAACTACGCCTTCCGGTACACTTACCATGTTACGACTTGCCTCCCCTCGTCAGCGCTT
+
#?ABBFBFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB
  • First row: Information. Name of instrument (HISEQ). Run ID (250). Flowcell ID (CADV5ANXX). Flowcell lane (5). Lane's tile (1101). Coordinates x,y (1832:1959). Wether the read is a backward(2) or forward(1) in a paired-end read or a single-end read(0). N/Y If Y the read must be removed according to quality check of sequencing instrument. No supplementary check (0) HISEQ. Barcode/index/tag of read.
  • Second row: DNA sequence
  • Third row: delimiter
  • Last row: quality scores are encoded into a compact form, which uses only 1 byte per quality value. In this encoding, the quality score is represented as the character with an ASCII code equal to its value + 33. For instance F corresponds to value 70-33=37. Phred-scaled quality scores are used to represent how confident we are in the assignment of each base call by the sequencer. The Phred quality score Q is logarithmically related to the error probability E.

Q = -10 \log(E)

Here is a table of how to interpret a range of Phred Quality Scores.

Phred Quality ScoreErrorAccuracy (1 - Error)
1010%90%
201%99%
300.1%99.9%
400.01%99.99%

Metabarcoding data processing workflow

Here, we describe the bioinformatics workflow used by SPYGEN to generate species environmental presence from raw eDNA data. This workflow is based on OBItools.

Sequencing data

Metabarcoding data to process are FASTQ files.

Structure of a DNA fragment is as follow:

5' [TAG] [PRIMER 5'] [BARCODE] [PRIMER 3'] [TAG] 3'

We sequenced each DNA fragment as paired-end meaning from both extremities:

Our aims will be to reconstitute the original DNA fragment sequence from forward and reverse reads and then to isolate, filter and cluster barcode sequences. Then a taxon will be assigned to each barcode allowing detection of species in samples.


1. Assemble reads

illuminapairedend aligns and merges the forward and the reverse reads of a couple of paired-end sequences. When the two reads overlap with a high alignment quality score, the consensus sequence is produced.

2. Demultiplexing

Multiplex sequencing is a technology to sequence multiple samples at the same time on a high-throughput DNA sequencer. Samples are distinguished by the short prefix of a DNA sequence called DNA tag. The demultiplex step consists to assign to each DNA sequence its original sample.

ngsfilter seeks for tag and primer sequences in both 5' and 3' direction of each consensus sequence. The primers and tags sequences are removed and annotations about the corresponding sample is added as new attributes.

3. Filtering and clustering

Identical barcode sequences are clustered using obiuniq. Then barcode sequences with a low depth coverage (meaning PCR amplification didn't work on the corresponding DNA fragment during sequencing so probably an error of sequencing) are removed using obigrep. Also, barcode sequences with a mean low quality phred-score are removed.

obiclean annotates the barcode sequences with the tags head, internal or singleton. The tag head corresponds to sequences that possess variant sequences in the dataset without being the variant of another record. Their variants are tagged as internal. The remaining sequences are annotated with the tag singleton. A variant must be related to the head sequence record with a number of differences in the alignment under a threshold value and is defined by a ratio between its count and the count of the head sequence which must be lower than a value fixed between 0 and 1. The command obigrep is then used to get rid of internal sequences.

4. Reference database

The reference database is a set of barcode sequences representative of taxon group such as species. Each taxon is located on the NCBI tree of life.

Here the strategy to build a reference database for metabarcoding:

5. Taxonomic assignment

ecotag associate the barcode sequences with taxonomic information from the reference database. It compares the record to the sequences of the provided reference database and assign the most similar one when the similarity reaches a fixed threshold. The command annotates the sequence records with the following attributes: the sequence of the reference database that matches best, the score of identity, a unique integer taxid often referring to one taxon in the NCBI taxonomy and the scientific name linked to the taxid integer. It allows to identify the most recent common ancestor of each sequence.

6. Format table

Some unuseful attributes are removed. The barcode sequences are sorted by decreasing order of count. Finally, a tab-delimited file that can be open by excel or R is generated. The table is a matrix with row as barcode sequences cluster and colon as sample abundance. Taxonomic assignment for each barcode sequence is mentionned.

Conclusion

Thanks to OBItools we were able to process eDNA raw sequencing data in order to detect species among samples. First, DNA fragments are merged, then demultiplexed (each sequence is assigned to its original sample). Third, sequence barcodes are extracted, filtered and clustered by similarity. Then a taxon is assigned to each barcode sequences. Ultimately, a table of environmental presence of each species abundance among samples is produced allowing further analysis.

References