A global ocean atlas of eukaryotic genes

Description of a metatranscriptomics approach to capture expressed genes in open ocean Tara Oceans stations.

Single-celled microeukaryotes and small multicellular zooplankton account for most of the planktonic biomass in the world’s ocean. Seawater samples were collected from the global ocean during the Tara Oceans expedition to generate a global ocean reference catalog of genes from planktonic eukaryotes sampled RNA and to explore the expression patterns of community at different microscopic scales with respect to biogeography and environmental conditions.

RNA sequencing

The key step in this technique is to convert RNA fragments into complementary DNA (cDNA library) using a reverse transcriptase. Adapters are then added to each end of the fragments. These adapters contain functional elements which permit sequencing. For example, the amplification element and the primary sequencing site. The cDNA library is then sequenced, producing short sequences called reads which correspond to either one or both ends of the fragment. The depth to which the library is sequenced varies depending on techniques which the output data will be used for.

1. Nucleic acid extraction

Cryogenic grinding is a very effective technique for taking hard substances, like plant and animal tissues, and turning them into dust. The DNA and RNA can be then extracted simulatenously. Then RNA is purified using a phenol-chloroform extraction method. DNA and RNA are quantified by a fluorometric method.

2. Library construction

Deoxyribonuclease (DNase) treatment is applied on RNA extractions. To target coding RNA, 3' polyadenylated tail RNA selection is used to limit rRNA quantity. Then RNA is reverse transcribed to cDNA because DNA is more stable and to allow for amplification (which uses DNA polymerases). DNA is fragmented by sonification to obtain fragments of a predifined length of 101 bp. The cDNA for each experiment are indexed with a barcode, so that these experiments can be pooled into a single lane for multiplexed sequencing.

3. Sequencing

Once the library is prepared, and adapters added, cDNA library is sequenced on HiSeq instrument with a read length of 101 bp in a paired-end mode.

Key words

Complementary DNA (cDNA): a DNA synthesized from a RNA molecule (e.g., messenger RNA (mRNA)) in a reaction catalyzed by the enzyme reverse transcriptase.

Library: a pool of DNA fragments with adapters attached. Adapters are designed to interact with the sequencing platform by the surface of the flow-cell (Illumina). A sequencing library can be made by starting from raw genomic DNA or from RNA. In any RNA sequencing library there’s an additional step: the RNA conversion in cDNA. The fragmentation step can be done before or after the cDNA synthesis.

Reads: a literal sequence generated by the sequencing instrument. The read is ultimately the sequence of a section of a unique fragment from the library. If an RNA is expressed in high copies then there will be more reads coming from it so that reads can be redundant as well.

Multiplexing: to pool and sequence libraries simultaneously during a single run on sequencing instruments. With multiplex sequencing, individual barcode sequences are added to each DNA fragment during next-generation sequencing (NGS) library preparation so that each read can be identified and sorted before the final data analysis.

Assembly and gene catalog construction

Each read sequences are assembled into a contig using VELVET. VELVELT is an assembler which takes advantage of De Bruijn graph approach to focus on observed words (or k-mers) among the read sequences. A given k-mer is therefore represented by a unique node, regardless of how many times it was observed. The costliest part of constructing a de Bruijn graph consists in hashing all the reads, according to a given word length (selected k-mer size is 63 in this study). This operation offers nonetheless an advantageous time complexity compared to a general pairwise alignment of all the sequences, especially given the high coverage depths encountered. Once all the reads have been hashed, each of their paths are traced along the k-mer nodes, incrementing coverage and creating the appropriate arcs along the way.

For each cluster of contig, the longest contig is kept as the representative sequence of the cluster. Sequences were BLASTed against a reference database of known sequences from Prokatyote, mitochondrial and chloroplastic specimens. Sequence with at least 70% identity with the reference database are removed. So that we obtain a catalog of unigenes. In summary a unigene is a complete or partial transcript assembled from metatranscriptomic reads of at least one Tara Oceans station.

Taxonomic assignment

To assign a taxonomic group to each unigene, a reference database is built from UniRef. Sequence similarities between metatranscriptomic gene catalog and reference database is handled by DIAMOND. DIAMOND is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. It allows pairwise alignment of proteins and translated DNA at 100x-20,000x speed of BLAST. Taxonomic affiliation uses a weighted lowest common ancestor approach. For each unigene only taxon reference sequences with at least 90% similarity were kept.

Functionnal characterization of unigenes

Protein domain prediction is performed using HMMER. HMMER is used for searching sequence databases for sequence homologs, and for making sequence alignments. Unigenes were seeked in the profile database Pfam. So that a protein domain is assigned to each unigene. Unigenes without assigned protein domain are not kept.

Protein and profile database

UniProt: the database of protein sequence and functional information, many entries being derived from genome sequencing projects.

UniGene: partition of sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

UniRef: UniProt Reference Clusters provide clustered sets of sequences from UniProt in order to obtain complete coverage of the sequence space at several resolutions.

Pfam: the database of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. Pfam entry data are based on UniProt reference proteomes.

Expression and abundance of unigenes

In order to estimate the abundance (a proxy of expression) of each unigene in each sample, reads were mapped against the gene catalog. Unigene expression values and genomic occurences are computed in RPKM. Then the abundance of each unigene was normalized and formulated in two different ways.

Reads Per Kilobase of transcript, per Million mapped reads (RPKM) is a normalized unit of transcript expression. It scales by transcript length to compensate for the fact that most RNA-seq protocols will generate more sequencing reads from longer RNA molecules


The global ocean transcript catalog reported here represents a first resource to study extensively and uniformly the gene content of eukaryotes and the dynamics of their expression in the environment. Metatranscriptomics is an effective approach to observe functionnal ecology of planktonic communities in the ocean.