Computational prospecting the great viral unknown [PDF]

FEMS Microbiology Letters, 363, 2016, fnw077 doi: 10.1093/femsle/fnw077 Advance Access Publication Date: 29 March 2016 Minireview

M I N I R E V I E W – Virology

Computational prospecting the great viral unknown Bonnie L. Hurwitz∗ , Jana M. U’Ren and Ken Youens-Clark Department of Agricultural and Biosystems Engineering, University of Arizona, Tucson, AZ 85721, USA ∗

Corresponding author: Department of Agricultural and Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, AZ 85721, USA. Tel: 5206269819; E-mail: [email protected] One sentence summary: A review of bioinformatics methods for analyzing viromes towards understanding viral ecology, evolution and the impact of viruses on host-driven biological processes. Editor: Andrew Millard

ABSTRACT Bacteriophages play an important role in host-driven biological processes by controlling bacterial population size, horizontally transferring genes between hosts and expressing host-derived genes to alter host metabolism. Metagenomics provides the genetic basis for understanding the interplay between uncultured bacteria, their phage and the environment. In particular, viral metagenomes (viromes) are providing new insight into phage-encoded host genes (i.e. auxiliary metabolic genes; AMGs) that reprogram host metabolism during infection. Yet, despite deep sequencing efforts of viral communities, the majority of sequences have no match to known proteins. Reference-independent computational techniques, such as protein clustering, contig spectra and ecological profiling are overcoming these barriers to examine both the known and unknown components of viromes. As the field of viral metagenomics progresses, a critical assessment of tools is required as the majority of algorithms have been developed for analyzing bacteria. The aim of this paper is to offer an overview of current computational methodologies for virome analysis and to provide an example of reference-independent approaches using human skin viromes. Additionally, we present methods to carefully validate AMGs from host contamination. Despite computational challenges, these new methods offer novel insights into the diversity and functional roles of phages in diverse environments. Keywords: virus; phage; bacteriophage; metagenomics; bioinformatics; virome

INTRODUCTION Metagenomics is revolutionizing our understanding of microbial communities by providing insight into the genetic diversity of uncultured microorganisms. When coupled with nextgeneration sequencing (NGS), abundant sequence data can be obtained that represents both dominant and rare species in microbial assemblages. These advances have been fundamental to studying phages (dsDNA viruses that infect bacteria) as previous efforts to infect bacterial hosts and isolate viral plaques for genome sequence have been limited due to the minute number of bacterial hosts that can be cultivated (∼1%; Amann, Ludwig and Schleifer 1995). Viral metagenomic surveys (hereafter referred to as viromes) are uncovering a wealth of novel viral genes that illustrate phages represent the most genetically di-

verse and uncharacterized entity on Earth (Edwards and Rohwer 2005). Moreover, studies into the functional capacity of viromes are beginning to elucidate viral-host networks, indicating that phages may play an important role in ecosystem function beyond predator-prey interactions (Breitbart 2012; Hurwitz, Hallam and Sullivan 2013; Hurwitz, Brum and Sullivan 2015; Brum et al. 2016). Metagenomic data has also allowed large-scale analyses of viral host range, which illustrate that at large spatial scales phage communities have highly modular interaction networks, whereby groups of phages specialize on distinct groups of hosts (Lima-Mendez et al. 2015; see also Flores, Valverde and Weitz 2013; Weitz et al. 2013). Yet, as we embark on computationally prospecting the great viral unknown, we must take into account the strengths and limitations of current computational analyses of viromes.

Received: 21 January 2016; Accepted: 27 March 2016 C FEMS 2016. All rights reserved. For permissions, please e-mail: [email protected]

1 Downloaded from https://academic.oup.com/femsle/article-abstract/363/10/fnw077/2197653/Computational-prospecting-the-great-viral-unknown by guest on 09 September 2017

2

FEMS Microbiology Letters, 2016, Vol. 363, No. 10

An Overview of Viral Metagenomic Analyses Phase 1: Sample Preparation

Phase 2: Data Preparation

Phase 3: Data Analysis Research Question(s)

Environmental Or Laboratory Sample(s) Run1 : Raw Reads

Run2 : Raw Reads de Bruijn Graph-Based Read Assembly

K-mer Analysis Filter (0.45 µm & 0.22 µm) Density Gradient Ultracentrifugation PEG Precipitation Nuclease Treatment

Sequence Quality Control Run1: Trim Ends Remove Low-Quality Reads

Run2: Trim Ends Remove Low-Quality Reads

e.g., Identify K-mers Unique To Each Environment A

Sample Quality Control SybrGreen Staining Of VLPs

PHACCS Toolkit

Content or Pattern-Based Gene Finding e.g., Prodigal

P

Depreplicate Reads Remove Rare Reads (e.g., singletons)

X Protein Clustering

Removal Of Host Contaminants Remove Host Genome(s) Remove GTAs Remove Sporadic Laboratory Contamination Quality Control PCR 16S rRNA/18S rRNA To Assess Contamination High Quality Virome(s)

BLASTx rep sequence

Taxonomic Functional Data Data

Protein Richness

Virome Diversity

Multiplexed Next-Gen Sequencing (Illumina Hi-Seq/Mi-Seq, Ion Torrent, 454)

Novel AMG Discovery

Figure 1. An overview of viral metagenomic analyses. Recommended methods for preparing samples to ensure high-quality viromes for taxonomic, functional and comparative metagenomic analyses. Data preparation steps in Phase 2 are essential to removing host contamination to discover novel auxiliary metabolic genes (AMGs).

In this review, we provide a detailed description of methods for viral metagenomic analyses, with a focus on dsDNA viruses. We describe these methods in detail for: (i) sequence quality control; (ii) identification of host and laboratory contaminants; (iii) taxonomic and functional analyses to reference databases; and (iv) emerging reference-independent approaches for comparative viromics. Each of these analyses is examined in the context of the computational capacity and integrity of new and standard algorithms for studying viruses using largescale NGS viromes. Further, we outline a series of analyses to robustly identify viral-encoded host metabolic genes (auxiliary metabolic genes; AMGs) from contaminating prophage in bacterial genomes and bacterial gene transfer agents (GTAs; reviewed in Lang, Zhaxybayeva and Beatty 2012). Lastly, we compare and contrast reference-based versus reference-independent approaches by analyzing 32 paired viromes and microbiomes from a single individual across eight body sites at two time points from a recent survey of the skin microbiome (Hannigan et al. 2015). Based on these computational methods, we recommend a path forward for a complete comparative analysis of viromes towards understanding the interaction between phages and their hosts in diverse ecosystems.

THE DATA ANALYSIS CONTINUUM FOR VIRAL METAGENOMICS Methods for distinguishing prokaryotic and eukaryotic microbial species and estimating their diversity in environmental samples are primarily based on sequencing of the small subunit ribosomal RNA gene (i.e. 16S or 18S rRNA; reviewed in Franzosa et al. 2015). However, in contrast to their bacterial hosts, phages lack a universal gene marker to estimate diversity and taxonomy (Edwards and Rohwer 2005). Instead, DNA is extracted

from viral-like particles (VLPs) and whole metagenome shotgun sequencing approaches are used to produce viromes that represent the complete genetic composition of viral communities (Thurber et al. 2009). When the reads from viromes are compared to known proteins in public databases, upwards of 90% of reads remain unassigned and join the ranks of ‘viral dark matter’ (Minot et al. 2011, 2013; Hurwitz and Sullivan 2013; Hannigan et al. 2015). New reference-independent approaches are emerging that allow for a complete comparison of viromes, despite the fact that a large fraction of sequences are unknown (Angly et al. 2005; Hurwitz et al. 2014). Here, we provide a detailed overview of the complete data analysis continuum for dsDNA viruses, with the goal of advancing viral ecological, evolutionary and functional analyses (see Fig. 1).

Phase 1: Sample Preparation Generating viromes The methodology of creating dsDNA viromes is well established and a quantitative ocean virome ‘sample-to-sequence’ workflow has been described in detail by Solonenko and Sullivan (2013). Briefly, samples are first pre-filtered to remove cellular debris. VLPs are then concentrated using FeCl3 precipitation and purified (e.g. DNAse alone, DNAse + cesium chloride density gradient ultracentrifugation or DNAse + sucrose density gradient ultracentrifugation; Hurwitz et al. 2013; Poulos 2015a,b,c; Brum 2016; Rohwer 2016). Following purification, viral DNA is extracted (Rich 2016) and DNA is fragmented to produce appropriately sized fragments for the specific NGS platform used (e.g. 150–300 bp, but maximum insert size for Illumina paired-end sequencing is 800 bp; see Illumina Paired End Sample Prep Guide, Rev. E., February 2011). Different fragmentation methods have various advantages and pitfalls (as discussed in Solonenko and

Downloaded from https://academic.oup.com/femsle/article-abstract/363/10/fnw077/2197653/Computational-prospecting-the-great-viral-unknown by guest on 09 September 2017

Hurwitz et al.

Sullivan 2013). Ideally the chosen method will yield fragments of the desired size while minimizing sample loss and the introduction of bias. Once fragmented, ends are repaired and adaptors specific to the NGS platform are added to the genomic DNA fragments (see details in Solonenko and Sullivan 2013). The library is then size-selected and amplified for NGS sequencing (Duhaime, Deng and Poulos 2012; Poulos, Duhaime and Deng 2016) (see Fig. 1). Following amplification and DNA quantification, viromes are ready for sequencing using NGS technologies (e.g. Illumina, Ion Torrent or 454). Sample quality measures should be taken at each of the above steps to ensure high-quality viromes (e.g. SYBR staining of VLPs to assess purity; see Thurber et al. 2009; Solonenko and Sullivan 2013). In addition, to assess the potential for host contamination during sample preparation, a small aliquot of the DNA can be PCR amplified using universal 16S/18S rRNA primers and visualized on an agarose gel (Solonenko 2016). The presence of bands indicates significant host contamination, whereas the absence of bands suggests minimal host contamination (see also Salter et al. (2014) for discussion of laboratory contamination). Constructing unbiased viromes The resulting reads in viromes theoretically represent the relative abundance and genetic content of viruses in the community sample. Yet, when viral DNA yields are low it is often necessary to amplify viral DNA, which can introduce biases in the viral community composition. Specifically, amplification of raw genomic DNA with multiple displacement amplification (MDA) has been found to preferentially amplify small circular ssDNA viruses and to generate chimeras (Yilmaz, Allgaier and Hugenholtz 2010). Due to these limitations, MDA-amplified viromes may not accurately quantify viral genes or species from community samples. In contrast, linker-amplification approaches have been shown to produce unbiased representations of viral communities using 454 and Illumina NGS (Duhaime, Deng and Poulos 2012; Hurwitz et al. 2013), but are sensitive to enzyme choice for PCR amplification (Hurwitz et al. 2013). Specifically, the Taq DNA polymerase TaKaRa enzyme amplifies a higher fraction of rare reads in a virome (Hurwitz et al. 2013). For very low-input DNA samples, linear amplification for deep sequencing is an alternative to PCR amplification for amplifying library DNA (Hoeijmakers et al. 2011). Recently, standard kits such as Illumina Nebnext and Nextera kits have been successfully applied with low quantities of DNA required as input (see Brum et al. 2015). Overall, careful selection of viral amplification methods and reagents is essential to ensure that viral abundance, diversity and functional analyses are not biased due to methodological artifacts (as previously reviewed in Duhaime and Sullivan 2012; Solonenko and Sullivan 2013).

Phase 2: Data Preparation Read quality control DNA from multiple samples can be multiplexed as a single ‘run’ on a NGS machine using barcodes unique to each sample. Different NGS platforms (e.g. Ion Torrent, Illumina MiSeq, Roche 454) yield reads of different lengths, paired- versus single-end reads and variable error rates, such that researchers need to address specific issues related to their platform of choice during data preparation and analysis (see Lam et al. 2012; Loman et al. 2012). However, one critical aspect of each study is adequate sequencing coverage for each sample. For example, one million 100 bp reads/sample, would yield 25x coverage, assuming ∼101 viral genomes per sample with an average viral

3

genome size of 0.4 Mb. However, coverage can vary based on sample diversity and abundance. To ensure adequate depth for complex communities, a study may choose to utilize multiple sequencing runs or lanes. Because quality can vary between sequencing runs, each run should be separately assessed to determine appropriate quality control cutoffs (as depicted in Fig. 1). Replicating a subset of samples across runs can be used to assess the degree of run-to-run variation. Quality control steps typically include: (i) visualizing raw read data for each run to determine downstream quality control parameters (e.g. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/) or fastx tookit (http://hannonlab.cshl.edu/fastx toolkit/index.html)); (ii) separating individual viromes by barcode (i.e. demultiplexing); (iii) trimming low quality bases from the beginning and/or ends (e.g. Q < 20); (iv) removing short reads (e.g. reads 1; see Edgar and Flyvbjerg 2015); and (vi) removing technical replicates or duplicated reads (Hoff 2009). Additional quality control steps may be required to account for other platform specific errors (Thomas, Gilbert and Meyer 2015). Conversely, some measures may be less relevant depending on downstream analyses. For example, removing short reads is useful prior to assembly or annotation steps, but their removal is not a requirement for reference-free analyses (described below). Removing rare reads The remaining high quality reads can then be compared to reads in the same virome to detect rare reads (i.e. singletons) (Hurwitz et al. 2013). These rare reads often have fewer matches to known proteins than their more abundant counterparts, and may represent sequencing errors or contamination (Hurwitz et al. 2014). Moreover, rare reads can increase complexity in de Bruijn graphs leading to difficulties in generating accurate contigs during assembly (Pell et al. 2012). Rare reads can also impact rarefaction analysis by inflating the amount of protein richness in a virome (Hurwitz et al. 2013; Hurwitz and Sullivan 2013) and should be removed prior to follow-up analyses. To address this issue, automated methods such as digiNorm have been developed to enable digital normalization of reads to remove error (Howe et al. 2014; Crusoe et al. 2015). Removing host contaminants Although VLPs and viral DNA are purified through sample preparation, a small amount of host contamination can still be present. These contaminants can be derived from GTAs that encapsulate random fragments of a host genome and co-purify with VLPs (Lang, Zhaxybayeva and Beatty 2012), host DNA that was not removed during purification (see Roux et al. 2013), or sporadic laboratory contamination during sample preparation. Multiple lines of evidence from taxonomic and functional analyses can be applied toward identifying bacterial contaminants (Fig. 2). Prior to data analysis reads from viromes should be compared to bacterial reference genomes and 16S rRNA databases to determine the extent of host contamination (Hurwitz, Hallam and Sullivan 2013) and the suitability of the virome for downstream analyses (see detailed methodologies in DISTINGUISHING BACTERIAL FROM VIRAL SIGNAL IN VIROMES).

Phase 3: Data Analysis Determining phage taxonomy and function To date, only 1600 phage reference genomes—representing a paucity of bacterial hosts and diversity—exist in NCBI’s


4


Distinguishing Viral From Bacterial Signal BLASTx To All Protein DB

BLASTn To 16S rRNA Ref DB

Taxonomic Analysis

Taxonomic Analysis

Retreive Taxonomic Information From BLASTx Hit

Retreive Taxonomic Information From BLASTn Hit

Virus (or Unknown)

Bacteria

Functional Analysis

Functional Analysis

Retreive Functional Information From BLASTx Hit

Retreive Functional Information From BLASTx Hit

Metabolic Genes

Single to Few Bacterial Species

Abundant Hits To Few Bacteria

NO

Same Order as 16S rRNA

YES

Recruitment Plot

Recruitment Plot

Gene appears on a contig with other viral genes

regions of host genome(s)

Reads clustered randomly across host genome(s)

YES

YES

YES

Viral Encoded Host Genes

Prophage

Bacterial GTA

Genome Context

Viral Derived Genes

+

Host DNA Prophage

PSII

Multiple Bacterial Species

Subset Of Virome Reads Represent Prophage

Bacterial Laboratory Contamination

Host Genome Randomly Sheared

DNA Packaged InViral Capsid

Virome Potentially Contaminated During Processing

Viral Encoded Host Genes Can Manipulate Host Gene Expression

Viral Ecological, Evolutionary, and Functional Analyses

Not Suitable For Downstream Viral Analyses

Figure 2. Distinguishing viral from bacterial signal. A subset of sequences in viromes will match bacteria but be of viral origin, as in the case of host-genes encoded in viral genomes (AMGs) and prophage in bacterial genomes. This workflow describes how legitimate matches to bacteria can be distinguished from host contamination derived from sporadic laboratory contamination or gene transfer agents (GTAs).

database of viral genomes (6355 total viral genomes) as of December 2015. Specifically, 45% of phage genomes are derived from just four clinically relevant genera of bacteria (Mycobacterium, Enterobacteria, Pseudomonas and Staphylococcus). In addition to whole genomes, nucleotide and protein reference databases such as NCBI non-redundant nucleotides (nt) or UniParc (UniProt Consortium 2015) can be used to annotate viral reads. Compiled resources of all predicted viral proteins are also available through UniProtKB/TrEMBL, with the added advantage that these data link back to protein databases and curated resources (e.g. Gene Ontology). Although these databases are useful in illuminating viral ‘dark matter’ where known proteins exist the computational search time can be intractable, especially given the size of reference databases and number of reads in modern NGS datasets (see COMPUTATIONAL COMPLEXITY

AND ALGORITHM INTEGRITY below). As a result, annotations are often performed on a reduced dataset (e.g. open reading frames (ORFs) from assembled contigs) rather than on individual reads. However, sequences with no close matches in databases have a greater potential for misalignment and inaccurate taxonomic assignment, which can result in biased taxonomic and functional analyses of viromes if search parameters are not stringent. Assessing protein diversity in viromes through protein clustering Given that the majority of reads and predicted proteins from viromes do not match known proteins, recent efforts have been made to develop protein clusters for comparative analyses (Yooseph et al. 2007; Hurwitz and Sullivan 2013; Brum et al. 2015). Although most clusters lack annotation, they can be used


Hurwitz et al.

as a means to organize and compare unknown proteins across viromes and help define unknown proteins specific to certain ecosystems or environments. Further, rarefaction analyses of protein clusters can be used to compare virome protein richness (Hurwitz and Sullivan 2013). To create protein clusters, viromes are assembled individually, ORFs are detected on the contigs and subsequently clustered with existing viral proteins clusters. ORFs that are not clustered are then self-clustered to create new protein clusters for annotation and analysis as implemented in CyVerse (Goff et al. 2011; http://www.cyverse.org/) in the PCPipe App through the iVirus project (Hurwitz 2014). Reads from the original viromes can be recruited back to protein clusters, called ‘fragment recruitment’, to address questions related to taxonomic diversity or biochemical richness of gene families in viral communities, as well as evolutionary questions related to the environment. For example, protein clusters can be used to find the ‘core’ versus ‘flexible’ component of viromes from diverse ecosystems (Hurwitz and Sullivan 2013; Brum et al. 2015; Hannigan et al. 2015; Hurwitz, Brum and Sullivan 2015). The caveats to these analyses are that (i) as few as 30% of reads may be contained in contigs (Brum et al. 2015) and (ii) accurate ORF calls are dependent on high-quality assemblies. The latter is a serious concern given that most assembly and gene calling algorithms are optimized for bacteria not viruses (see COMPUTATIONAL COMPLEXITY AND ALGORITHM INTEGRITY FOR VIROMES below). Determining viral diversity through contig spectra Viral diversity can be calculated in a reference-independent manner by calculating the amount of contig assembly in a contig spectra through the PHACCS toolkit (Angly et al. 2005). Specifically, the contig spectra from a virome is compared to simulated viral communities that vary in size and diversity until an approximate match is found. These data can then be employed in downstream alpha- and beta-diversity metrics, but because these analyses require high-quality assembly they are limited to a fraction of the total read data (see COMPUTATIONAL COMPLEXITY AND ALGORITHM INTEGRITY FOR VIROMES below). Ecological profiling for comparative viral metagenomics Recently, ecological profiling techniques have been employed in viromes to visualize the amount of ‘shared sequence space’ (weighted by environmental factors) using Bayesian network analyses (Hurwitz et al. 2014; Brum et al. 2016). To employ these methods, the relative distance between viromes must be calculated. Given smaller 454 datasets, an all-vs-all sequence analysis between viromes is possible (Hurwitz et al. 2014). Yet, given the size of modern NGS datasets (>1M reads), computing allvs-all sequence analyses is not feasible with alignment algorithms such as BLAST or even fast k-mer composition based algorithms such as Jellyfish (Altschul et al. 1997; Marçais and Kingsford 2011). One workaround currently implemented in the CyVerse cyberinfrastructure iVirus project (Goff et al. 2011; http://www.cyverse.org/) is the Fizkin App (Youens-Clark, 2016). This App randomly subsets 300K reads from each virome, compares reads in a pairwise fashion using Jellyfish, and visualizes the data using a Bayesian network graph with a maximum of 15 viromes at a time. This approach infers relationships between viral communities and environmental factors without requiring assembly or annotation, but it is limited by sample size (Hurwitz et al. 2014; Brum et al. 2016). New methods such as Mash (Ondov et al. 2015) that use unique ‘sketches’ to rapidly compare the distance between metagenomes may provide another promising path forward. With the availability of large-scale viromes

5

from entire ecosystems, such big data approaches for comparative metagenomics are needed (Council 2010; Karsenti et al. 2011; Brum et al. 2015).

COMPUTATIONAL COMPLEXITY AND ALGORITHM INTEGRITY FOR VIROMES Understanding the advantages and caveats of metagenomic algorithms for viral ecology is fundamental to interpreting matches to known genomes and better defining ‘viral dark matter’. This section focuses on understanding the limitations of alignment-based algorithms and recent advances in k-merbased read assignment to known genomes for applications in viral ecology.

Algorithms for sequence classification in viromes Alignment-based algorithms that use sequence similarity to classify reads (e.g. BLAST; Altschul et al. 1997) are commonly used to find sequence matches between viromes and known reference databases. BLAST can either be run independently, or as part of a metagenomic software package such as MG-RAST (Glass et al. 2010), MetaPhyler (Liu et al. 2011), or CARMA (Gerlach et al. 2009). BLAST results also can be incorporated into MEGAN for taxonomic classification (Huson et al. 2011). Alignment-based approaches have high accuracy but extensive compute times that are up to 56x slower than composition based algorithms that use k-mers (Hurwitz et al. 2014). This is because alignmentbased methods require large reference databases to maximize the number of annotations obtained for comprehensive virome comparisons. Other alignment-based approaches such as hidden markov models (e.g. HMMER; Finn, Clements and Eddy 2011) may also be used to find distant matches to Pfam or KEGG protein domains, but these methods are also compute intensive. Thus, although alignment-based approaches excel in accuracy, they fail to scale to the size of modern viromes. Moreover, even when reads from viromes have been aligned to reference proteins, often less than 10% of reads have matches (Hurwitz and Sullivan 2013). Recently, fast k-mer based algorithms have been used to classify metagenomic reads against known bacterial genomes at remarkable throughput and speed. Specifically, CLARK (CLAssifier based on Reduced K-mers) (Ounit et al. 2015), USEARCH (Edgar 2010), KRAKEN (Wood and Salzberg 2014) and NBC (Naive Bayes Classifier) (Rosen et al. 2008) offer composition-based approaches to quickly identify microbial species present in a metagenome. In each case, frequency profiles of k-mers from microbial genomes are built to rapidly assign reads to genomes in a reference database (Rosen, Reichenberger and Rosenfeld 2011; Bazinet and Cummings 2012; Wood and Salzberg 2014; Ounit et al. 2015). Although these methods are exceptionally fast and outperform alignment based methods, building and using the frequency profile for the reference database requires large-amounts of memory (>128GB of RAM) (Ounit et al. 2015). As a result, the database for taxonomic classification is typically constrained to reference genomes only (e.g. NCBI’s RefSeq). Given the limited representation of viral genomes in RefSeq (described above) these methods produce only a small fraction of annotation for viromes. To examine this, we compared reads from the Pacific Ocean Virome dataset that was previously annotated to all known proteins at the genus and family level using BLASTx (Hurwitz and Sullivan 2013) to taxonomic data resulting from CLARK (Ounit et al. 2015) (Fig. 3). Overall, we found that 0.96% and 1.12% of reads matched regions in viral and


6


CLARK

BLASTx Myoviridae Podoviridae Siphoviridae Phycodnaviridae Other Viral Family Bacterial

GF.Spr.C.9m GD.Spr.C.8m M.Fall.O.105m M.Fall.O.10m M.Fall.I.42m M.Fall.I.10m M.Fall.C.10m STC.Spr.C.5m SFS.Spr.C.5m SFD.Spr.C.5m SFC.Spr.C.5m L.Win.O.10m L.Sum.O.10m L.Spr.O.10m L.Spr.I.10m L.Spr.C.10m M.Fall.O.4300m M.Fall.O.1000m L.Win.O.2000m L.Win.O.1000m L.Win.O.500m L.Sum.O.2000m L.Sum.O.1000m L.Sum.O.500m L.Spr.O.2000m L.Spr.O.1000m L.Spr.I.2000m L.Spr.I.1000m L.Spr.I.500m L.Spr.C.1300m L.Spr.C.1000m L.Spr.C.500m

97% unidentified

0

5

10

15

89% unidentified

20

0

5

10

15

20

25

30

Identified Reads (%) Figure 3. Comparison of alignment-based (BLASTx) versus k-mer-based (CLARK (Ounit et al. 2015)) taxonomic analysis for 32 viromes from the Pacific Ocean. Data are from the original paper (Hurwitz and Sullivan 2013).

bacterial genomes, respectively using CLARK, as compared to 6.87% and 4.01% for all viral and bacterial proteins respectively, using BLASTx. Thus, mapping reads to a database containing all proteins resulted in a nearly 9% increase in high-quality annotations compared to reference genomes alone. These results demonstrate that the completeness of the underlying database is vital for illuminating ‘viral dark matter’. In addition to taxonomic identification, recent approaches have been developed for rapid annotation of short reads against protein reference databases that are orders of magnitude faster than BLASTx, including USEARCH, PAUDA, RAPSearch2, UProC, RAST and DIAMOND (Aziz et al. 2008; Edgar 2010; Zhao, Tang and Ye 2012; Huson and Xie 2014; Buchfink, Xie and Huson 2015; Meinicke 2015). These fast protein searches either (i) use indexes or suffix arrays for complete k-mer matching of a query compared to reference database or (ii) match unique protein domains to find protein families (Aziz et al. 2008; Meinicke 2015). In either case, these algorithms are often constrained by system memory. For example, DIAMOND requires ∼1.6 TB of memory for the NCBI-nr database (∼100 GB; Buchfink, Xie and Huson 2015) and UProc requires 16 GB of memory for the Pfam database (∼4GB; Meinicke 2015).

Algorithms for assembly and gene calling One important, but frequently neglected consideration in analyzing viromes is that most metagenomic tools for whole genome shotgun analysis (including assembly and gene calling) were originally developed for bacteria (Noguchi, Park and Takagi 2006; Hyatt et al. 2010; Rho, Tang and Ye 2010; Bankevich et al. 2012; Boisvert et al. 2012; Kelley et al. 2012; Namiki et al. 2012; Peng et al. 2012). Unlike their hosts, viruses are under selective pressure to rapidly mutate and evade host detection (e.g. CRISPR systems), resulting in increased diversity in viromes (Minot et al. 2012). Because of the underlying diversity, assembly algorithms that use de Bruijn graphs are ideal for assembling viromes in two ways: (i) graphs provide a good representation of fragmen-

tary read data from community viromes and (ii) assembly can be optimized using various k-mer sizes to account for underlying sequence diversity (Minot et al. 2012). Further, the resulting contigs can be validated based on concepts in viral biology such as conserved gene cassettes in diverse viruses (Minot et al. 2012). Also, although paired-end versus single NGS reads require different assembly methods, modern assembly programs such as IDBA-UD and SPAdes (Bankevich et al. 2012; Peng et al. 2012) have built-in functions to assemble different types of data. Another consideration in the assembly of viromes is the varied repeat content among phages. This can complexify assembly by (i) creating branches in graphs that may lead to chimeras or (ii) fragmented assembly in conservative assembly approaches that lead to the loss of species represented (Treangen and Salzberg 2012). Typically, strategies for resolving repetitive DNA include: (i) sequencing multiple libraries that contain fragments of varied sizes to span repeats (Wetzel, Kingsford and Pop 2011); (ii) using tools to detect misassembly (Phillippy, Schatz and Pop 2008); and (iii) using coverage to find nodes in a de Bruijn graph with high coverage. Each of these approaches uses mate-pair libraries to resolve misassemblies given that short reads alone do not span these problematic regions. In the future, long read sequencing technologies (e.g. PacBio) may help to generate better phage genome assemblies by sequencing 10–15 kb regions, and is already being applied in phage isolates and temperate phages in cultured hosts (Beims et al. 2015; Wittmann et al. 2015). Given contigs and a refined assembly, ORFs should then be detected using unsupervised approaches that are optimized for detecting genes in novel genomes (Ter-Hovhannisyan et al. 2008; Hyatt et al. 2010), rather than using gene prediction software that uses training models derived from bacterial gene sets.

DISTINGUISHING BACTERIAL FROM VIRAL SIGNAL IN VIROMES Given the molecular arms-race between viruses and their hosts, determining the genetic origin of genes in viromes can be


Hurwitz et al.

problematic. Viruses can encode host genes in the case of AMGs, and bacteria can encode viral DNA in the case of integrated prophages and CRISPR-CAS systems. Perhaps even more complex, entire bacterial genomes can be found in viromes when GTAs co-purify with VLPs. Distinguishing each of these components is vital in understanding the complexities of host-virus co-evolution and how viral communities change over space and time.

GTAs versus sporadic contamination The 16S rRNA gene can be used as supporting evidence to infer the source of the ‘bacterial hits’ in a virome and determine the suitability for downstream ecological, evolutionary and functional analyses. Because viromes are comprised of WGS data, reads can be searched against 16S rRNA databases to find reads that match any hypervariable region or component of the 16S rRNA gene from diverse species of bacteria. These data can be used as a metric to determine the relative amount of contamination from bacteria given that viruses do not contain this gene in their genome. GTAs are suspected in a virome when there are abundant protein matches to a single or few bacteria, in the same bacterial order as 16S rRNA gene matches. The presence of GTAs can be further validated using a recruitment plot to map bacterial hits to a specific genome reference. If GTAs are present in the virome, hits will appear randomly across the entire bacterial reference genome (see Fig. 2) (Hynes et al. 2012). Further, matches to specific bacterial species or orders may provide additional evidence. To date, four genetically diverse GTAs have been discovered in three bacterial species (Rhodobacter capsulatus (Rhodobacterales), Desulfovibrio desulfuricans (Deltaproteobacteria), Brachyspira hyodysenteriae (Spirochaetales) and one Archaea (Methanococcus voltae). Recent evidence from 16S rRNA gene fragments in viromes suggests that GTAs may co-purify with viral particles and may be derived from the following bacterial lineages: Actinobacteridae, Rhizobiales, Burkholderiales and Alteromonadales (Hurwitz, Hallam and Sullivan 2013). Laboratory sporadic contamination, on the other hand, is often sparse (due to DNAse treatment to remove free cellular DNA, and assuming a negative 16S PCR result prior to sequencing to assess whether there is cellular contamination in the sample) and random. As such, it unlikely that hits to 16S rRNA from contamination will match to true bacterial proteins as in the case of AMGs. Lastly, fast alignment approaches to reference genomes such as CLARK (Ounit et al. 2015) may provide a mechanism for identifying GTAs or sporadic contamination in viromes. For example, the virome L.Spr.C.1000m has abundant matches to bacteria (16%; Fig. 3) with the majority matching a single bacterial species Pseudoalteromonas (Alteromonadales) (12%), a pattern that was previously suggested to be from GTAs (Hurwitz, Hallam and Sullivan 2013). Similarly, the virome M.Fall.I.42m was found to have 19% matches to bacteria with hits to a few dominant orders: Propionibacteriales, Burkholderiales and Rhizobiales (15% total) that may also represent GTAs. Other fast database search algorithms such as kmerID, UCLUST or kraken (Edgar 2010; Wood and Salzberg 2014; Ounit et al. 2015; https://github.com/phe-bioinformatics/kmerid) may also be used to rapidly search for read matches to just a few dominant bacterial reference genomes and could be useful in distinguishing GTAs, provided the contaminating genome is present in reference databases. Further, as noted above, a recruitment plot of GTAs will show matches randomly across the entire genome (Fig. 2).

7

Finding prophages in bacterial genomes Many phages live symbiotically with their hosts by integrating their genome into the host replicon. These phages are called temperate viruses or prophage, and may be most prevalent when host abundance is low (Paul 2008). When integrated in the host genome, the viral genome is replicated in daughter cells alongside the host genome and can encode functions that benefit both the host and phage. For example, prophage can encode virulence factors that enable bacterial attachment, invasion and survival within a eukaryotic host, often converting a non-pathogenic bacterial host to a virulent strain or increasing a pathogen’s virulence (e.g. bacteriophage CTX encodes the principal virulence factor (CT) of Vibrio cholerae; reviewed by Brussow, Canchaya and Hardt 2004). Best described from hu¨ man pathogens, virulence factors can encompass superantigens, pore forming lysins, exotoxins, and/or effector proteins or effector toxins (Boyd, Carpenter and Chowdhury 2012), but their prevalence in environmental samples is less well documented. Given environmental cues, prophage can extract themselves from bacterial genomes and enter into a lytic phase to replicate and kill their hosts. Prophage can be detected in bacterial genomes by aligning reads with hits to bacteria from a virome to bacterial reference genomes and using a ‘recruitment plot’ to identify specific regions of the bacterial genome with virome hits (Fig. 2). These genomic regions may also contain prophage-specific genes such as integrases, phage repressors or anti-terminators. However, some prophage may not contain these genes and matches to the virome may represent prophage remnants (Paul 2008). Thus, the signature for prophage is the recruitment of virome reads to a single region in a bacterial reference genome. These prophage reads should be reassigned as phage, rather than bacteria. Reference databases of documented prophage sequences and prophage finding tools such as ACLAME, Prophinder, PHAST and PhiSpy can further advance the robust taxonomic annotation in viromes (Lima-Mendez et al. 2008; Leplae, Lima-Mendez and Toussaint 2010; Zhou et al. 2011; Akhter, Aziz and Edwards 2012).

Auxiliary metabolic genes Phage genomes have been shown to contain AMGs that are expressed and may augment host metabolism during infection to facilitate viral production (reviewed in Rohwer and Thurber (2009); Breitbart (2012)). In the marine environment, the most well-studied example is the core photosystem II gene (psbA) in cyanophage (Sullivan et al. 2006) that is expressed (Lindell et al. 2005; Clokie et al. 2006), and bolsters host energy production during infection. Recent work on marine viromes has produced a catalog of potential AMGs with implications in host-driven carbon cycling and iron-sulfur cluster modulation for phage production (Hurwitz, Hallam and Sullivan 2013; Hurwitz, Brum and Sullivan 2015). Further, viromes in the Pacific Ocean have been shown to contain AMGs that are niche-specific (Hurwitz, Brum and Sullivan 2015). In particular, aphotic viromes contain AMGs for DNA replication initiation (diaA and dnaA) that are essential for deep sea survival in a high pressure environment. Thus, studying the co-evolution of phage and their bacterial hosts through detecting AMGs can have important implications in ecosystem function (see also Roux et al. 2015a,b). Detecting AMGs in viromes, however, requires a careful analysis of viromes for potential bacterial contamination (see Roux et al. 2013). Moreover, potential AMGs should be validated by


8


(B) 1

100

2

1 2

1 2

1 2

1 2

1 2

1 2

1 2

Taxonomic relative abundance (%)

Taxonomic relative abundance (%)

(A) 90 80 70 60 50 40 30 20 10 0

Ra

Fh

Oc

Ax

Um

Tw

Ac

Pa

Virome

1

100

2

1 2

1 2

1 2

1 2

1 2

1 2

Ax

Um

Tw

Ac

Pa

90 80 70 60 50 40 30 20 10 0

Ra Fh Oc Bacterial Metagenome

Pseudomonas phage Bacillus phage

Propionibacterium phage Planktothrix phage

Corynebacterium phage

Propionibacterium Staphylococcus

Streptococcus Corynebacterium

Actinobacteria Haemophilus

Rhodococcus phage

Staphylococcus phage

Other

Rothia

Lactobacillus

Pseudomonas

(C)

Oc

1.5

Oc Um

Ax

Ra

Oc

Ac AcUm Pa Pa

Ax

0.5

Fh

Oc

NMDS2

0.0

Other

(D) Ra

Fh

NMDS2

1 2

Ac Pa

Fh Ra Um Fh PaAc Ax Um RaAx

−0.5

−0.4

Tw Tw

−1.5

−1.0

−0.5

0.0

0.5

−2

NMDS1

Occluded

Intermittently Occluded

−1

0

Tw Tw

1

2

3

NMDS1

Exposed

Sebaceous

Intermittently Moist

Moist

Figure 4. Reference-based taxonomic analysis and NMDS analysis of reference-independent PHACCS contigs for 16 viromes and 16 microbiomes from a healthy 27-yearold male subject across 8 body sites at two time points over one month period (based on data from Hannigan et al. 2015). Taxonomy is based on relative abundance of dominant taxa in the (A) virome and (B) bacterial metagenome. Non-metric multidimensional scaling (NMDS) using Bray–Curtis dissimilarity of normalized sequence counts of each contig in the (C) virome and (D) bacterial metagenome. Solid circles represent time point one, whereas open circles represent time point two. Different colors indicate microenvironment or occlusion status. The eight skin sites include: retroauricular crease (Ra), occiput (Oc), axilla (Ax), umbilicus (Um), forehead (Fh), antecubital fossa (Ac), palm (Pa) and toe web (Tw). See Hannigan et al. (2015) for additional methodological details.

looking at the genome context of AMGs: ‘real’ AMGs will be imbedded in assembled contigs with other known viral genes. Given the limited number of phage genomes, finding AMGs in context with known phage genes can be difficult. This process would be greatly improved by the development of a resource containing proposed phage AMGs based on their representation in diverse and varied viromes, making it less likely that these are contaminants (Hurwitz and Sullivan 2013; Brum et al. 2015).

A CASE STUDY IN VIRAL ECOLOGY FROM THE HUMAN SKIN VIROME Methods in viral ecology are constantly evolving, making it difficult to compare methods across studies. Here, we present a comparison of reference-based taxonomic analyses versus two reference-independent methods using a subset of data from a recent study of the human skin virome (Hannigan et al.

2015). Given significant interpersonal variation among subjects in the skin virome and microbiome, we reanalyzed the data for a single individual (Figs 4 and 5). Specifically, the subset of data consisted of 32 paired viromes and microbiomes from a healthy 27-year-old male at eight body sites over a one month period. Analyses in the original study consisted of (i) a reference-dependent taxonomic analysis where ORFs in contigs were compared to proteins in the UniProt reference database (UniProt Consortium 2015) (Fig. 4A and B) and (ii) a reference-independent approach to estimate diversity using contigs derived from the PHACCS toolkit (Angly et al. 2005). We use the PHACCS data from the original paper to estimate spatial and temporal turnover in viromes of a single patient using non-metric multidimensional scaling (NMDS) based on Bray–Curtis dissimilarity of normalized sequence counts of each contig (Fig. 4B and C). We extend these analyses by including a reference-independent Bayesian network analysis using k-mers, as previously described (Hurwitz et al. 2014) (Fig. 5).


1.0

Hurwitz et al.

(B)

0.2

0.5

0.4

(A)

9

UmAx

Pa

Tw Ac

Fh

Tw

0.0

0.0

Pa Ra Oc

Ac

Oc

−1.0

−0.6

−0.4

−0.5

−0.2

Ax

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

−1.0 300

−0.8

−0.5

0.0

0.5

1.0

(D)

200

(C)

Ra Pa AcFh

0

Ra Oc PaUm Fh Ac

Ax

Tw

Ax

−400

−300

−300

−200

−200

−100

Tw

Um Oc

−100

0

100

100

200

RaUm

Fh

−400

−200

0

200

400

Occluded Intermittently Occluded Exposed

−400

−200

0

200

400

Sebaceous Intermittently Moist Moist

Figure 5. Bayesian network analysis of 16 viromes and 16 microbiomes from a healthy 27 year old male subject across 8 body sites at two time points over one month period. (A) the skin virome at time point 1 (B) the skin virome at time point 2, a month later (C) the skin microbiome at time point 1 (D) the skin microbiome at time point 2, a month later. The eight skin sites include: retroauricular crease (Ra), occiput (Oc), axilla (Ax), umbilicus (Um), forehead (Fh), antecubital fossa (Ac), palm (Pa) and toe web (Tw). Colored text indicates the microenvironment classification, and each colored ball represents the occlusion status of the body site as per the original paper (Hannigan et al. 2015). Bayesian network analyses were performed as previously described (Hurwitz et al. 2014) with 300k randomly selected reads using the Fizkin App at CyVerse (Youens-Clark 2016).

Given these data, we address questions related to how different informatics approaches compare in determining (i) the biogeography and diversity of the skin virome as compared to the microbiome and (ii) how viral and bacterial communities change at various skin sites over the course of a month. Analysis of the complete dataset (512 paired viromes and microbiomes from 16 subjects across 8 body sites over a one month period) revealed that the composition of viral and bacterial communities is strongly associated with microenvironment (sebaceous, intermittently moist and moist) and skin occlusion sta-

tus (occluded, intermittently occluded and non-occluded) (Hannigan et al. 2015). The authors also note that community composition of the skin microbiome remained consistent over time for each body site, whereas the virome varied significantly. These results suggest that viral communities are more dynamic than bacterial communities, but both are structured based on characteristics associated with the skin sites. Reanalysis of the data for a single subject using two reference-independent methods recapitulates that viromes vary across sites and over time (Figs 4C, 5A and B), whereas the microbiome remains relatively stable


10


(Figs 4D, 5D and C). By comparison, the reference-based taxonomic analysis of the single subject (Fig. 4) revealed no clear pattern in viral or microbial community structure related to microenvironment and/or occlusion status. In this patient, differences in community structure appear more closely related to specific body sites rather than microenvironment or occlusion status. In the bacterial microbiome, social network analysis reveals the Tw (toe web space) and Ax (Axilla) communities differ from other body sites (Fig. 5), potentially due to the enrichment of Staphylococcus in the Tw and Ax as compared to other body sites (Fig. 4B). Interestingly, despite differences in the relative abundance of dominant bacterial taxa compared to other sites (as revealed by reference-based methods), the Um (Umbilicus) is clustered with other skin sites in the reference-independent network analysis and NMDS (Figs 4D, 5C and D). Given that 43% of microbiome reads remained unknown when compared to NCBI’s nr database (Hannigan et al. 2015), the total reads (unknown + known) for Um may have more shared sequence space by site than indicated by known sequences alone. Similarly, viral taxonomic data for Oc (Occiput) appears relatively consistent across time points (Fig. 4A), but referenceindependent methods indicate the viral composition varies between time points (Figs 4C and 5). Thus, methods that account for differences in the ‘unknown’ sequence composition are vital in interpreting differences in community sequences from either viromes or microbiomes. Reference-independent statistical approaches such as PHACCS (Angly et al. 2005) and Bayesian network analyses (Hurwitz et al. 2014) may provide an unbiased look at the composition of both virome and microbiomes.

OUTLOOK Metagenomic surveys of viral communities from diverse environments are expanding our understanding of host-phage interactions, phage biodiversity and ecosystem function. Yet, caveats exist in untangling viral from bacterial signal. As new methods and algorithms are developed for metagenomics, applications in viromics should be continually examined given disparities in the underlying sequence diversity between viral and bacterial communities over both space and time. Further, although new algorithms may provide rapid annotation based on reference genomes, viromes may require a full comparison to all reference proteins to differentiate the ‘known’ from the ‘unknown’. Reference-independent approaches that take into account both the known and unknown components of viromes (and their paired microbiome) will be fundamental to our understanding of host-phage interactions in a variety of environments. Future efforts require the development of big data algorithms to compute large-scale pairwise comparisons between viromes to create an organized resource for investigating viral dark matter. Further, algorithms to rapidly compare complete viromes to large-scale reference databases are needed. Lastly, a resource that provides access to tools and databases specific to viruses that can be continually refined by the community is necessary for documenting and understanding the great viral unknown (Hurwitz 2014).

ACKNOWLEDGEMENTS We would like to thank the Grice lab and especially Geof Hannigan for diligently making raw and intermediate data from the skin microbiome study described here available to the community for on-going inquiry.

FUNDING This work was supported by a grant from the Gordon Betty Moore Foundation ( #4491) to BLH. Conflict of interest. None declared.

REFERENCES Akhter S, Aziz RK, Edwards RA. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res 2012;40:e126. ¨ Altschul SF, Madden TL, Schaffer AA et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389–402. Amann RI, Ludwig W, Schleifer KH. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 1995;59:143–69. Angly F, Rodriguez-Brito B, Bangor D et al. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics 2005;6:41. Aziz RK, Bartels D, Best AA et al. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 2008;9:75. Bankevich A, Nurk S, Antipov D et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012;19:455–77. Bazinet AL, Cummings MP. A comparative evaluation of sequence classification programs. BMC Bioinformatics 2012; 13:92. Beims H, Wittmann J, Bunk B et al. Paenibacillus larvaedirected bacteriophage hb10c2 and its application in american foulbrood-affected honey bee larvae. Appl Environ Microbiol 2015;81:5411–9. Boisvert S, Raymond F, Godzaridis E et al. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol 2012;13:R122. Boyd EF, Carpenter MR, Chowdhury N. Mobile effector proteins on phage genomes. Bacteriophage 2012;2:139–48. Breitbart M. Marine viruses: truth or dare. Ann Rev Mar Sci 2012;4:425–48. Brum J. Concentrating viruses with centrifugal ultrafiltration devices. 2016, https://www.protocols.io/view/ConcentratingViruses-with-an-Amicon-or-Nanosep-Ce-c54y8v (12 January 2016, date last accessed). Brum JR, Hurwitz BL, Schofield O et al. Seasonal time bombs: dominant temperate viruses affect Southern Ocean microbial dynamics. ISME J 2016;10:437–49. Brum JR, Ignacio-Espinoza JC, Roux S et al. Patterns and ecological drivers of ocean viral communities. Science 2015;348, DOI: 10.1126/science.1261498. Brussow H, Canchaya C, Hardt W-D. Phages and the evolution of ¨ bacterial pathogens: from genomic rearrangements to lysogenic conversion. Microbiol Mol Biol Rev 2004;68:560–602. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods 2015;12:59–60. Clokie MRJ, Shan J, Bailey S et al. Transcription of a ‘photosynthetic’ T4-type phage during infection of a marine cyanobacterium. Environ Microbiol 2006;8:827–35. Council SNR. Malaspina Expedition. 2010. http://www.expedicionmalaspina.es/. Crusoe MR, Alameldin HF, Awad S et al. The khmer software package: enabling efficient nucleotide sequence analysis. F1000Research, 2015;4:900.


Hurwitz et al.

Duhaime MB, Deng L, Poulos BT. Towards quantitative metagenomics of wild viruses and other ultra-low concentration DNA samples: a rigorous assessment and optimization of the linker amplification. Environ Microbiol 2012;14: 2526–37. Duhaime MB, Sullivan MB. Ocean viruses: rigorously evaluating the metagenomic sample-to-sequence pipeline. Virology 2012;434:181–6. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010;26:2460–1. Edgar RC, Flyvbjerg H. Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics 2015;31:3476–82. Edwards RA, Rohwer F. Viral metagenomics. Nat Rev Microbiol 2005;3:504–10. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 2011;39:W29– 37. Flores CO, Valverde S, Weitz JS. Multi-scale structure and geographic drivers of cross-infection within marine bacteria and phages. ISME J 2013;7:520–32. Franzosa EA, Hsu T, Sirota-Madi A et al. Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling. Nat Rev Microbiol 2015;13:360–72. Gerlach W, Junemann S, Tille F et al. WebCARMA: a web ¨ application for the functional and taxonomic classification of unassembled metagenomic reads. BMC Bioinformatics 2009;10:430. Glass EM, Wilkening J, Wilke A et al. Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. Cold Spring Harb Protoc 2010;2010, DOI: 10.1101/pdb.prot5368. Goff SA, Vaughn M, McKay S et al. The iPlant collaborative: cyberinfrastructure for plant biology. Front Plant Sci 2011; 2:34. Hannigan GD, Meisel JS, Tyldsley AS et al. The human skin double-stranded DNA virome: topographical and temporal diversity, genetic enrichment, and dynamic associations with the host microbiome. MBio 2015;6, DOI: 10.1128/mBio.01578-15. ´ Hoeijmakers WAM, Bartfai R, Françoijs K-J et al. Linear amplification for deep sequencing. Nat Protoc 2011;6:1026–36. Hoff KJ. The effect of sequencing errors on metagenomic gene prediction. BMC Genomics 2009;10:520. Howe AC, Jansson JK, Malfatti SA et al. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci USA 2014;111:4904–9. Hurwitz B. iVirus. 2014, http://ivirus.us (21 January 2016, date last accessed). Hurwitz BL, Brum JR, Sullivan MB. Depth-stratified functional and taxonomic niche specialization in the ‘core’and ‘flexible’ Pacific Ocean Virome. ISME J 2015;9:472–84. Hurwitz BL, Deng L, Poulos BT et al. Evaluation of methods to concentrate and purify ocean virus communities through comparative, replicated metagenomics. Environ Microbiol 2013;15:1428–40. Hurwitz BL, Hallam SJ, Sullivan MB. Metabolic reprogramming by viruses in the sunlit and dark ocean. Genome Biol 2013;14:R123. Hurwitz BL, Sullivan MB. The Pacific Ocean Virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology. PLoS One 2013;8:e57355. Hurwitz BL, Westveld AH, Brum JR et al. Modeling ecological drivers in marine viral communities using compara-

11

tive metagenomics and network analyses. Proc Natl Acad Sci 2014;111:10714–9. Huson DH, Mitra S, Ruscheweyh H-J et al. Integrative analysis of environmental sequences using MEGAN4. Genome Res 2011;21:1552–60. Huson DH, Xie C. A poor man’s BLASTX–high-throughput metagenomic protein database search using PAUDA. Bioinformatics 2014;30:38–9. Hyatt D, Chen GL, LoCascio PF et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010;11:119. Hynes AP, Mercer RG, Watton DE et al. DNA packaging bias and differential expression of gene transfer agent genes within a population during production and release of the Rhodobacter capsulatus gene transfer agent, RcGTA. Mol Microbiol 2012;85:314–25. Karsenti E, Acinas SG, Bork P et al. A holistic approach to marine eco-systems biology. PLoS Biol 2011;9:e1001177. Kelley DR, Liu B, Delcher AL et al. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res 2012;40:e9. Lam HYK, Clark MJ, Chen R et al. Performance comparison of whole-genome sequencing platforms. Nat Biotechnol 2012;30:78–82. Lang AS, Zhaxybayeva O, Beatty JT. Gene transfer agents: phage-like elements of genetic exchange. Nat Rev Microbiol 2012;10:472–82. Leplae R, Lima-Mendez G, Toussaint A. ACLAME: a CLAssification of mobile genetic elements, update 2010. Nucleic Acids Res 2010;38:D57–61. Lima-Mendez G, Faust K, Henry N et al. Determinants of community structure in the global plankton interactome. Science 2015;348, DOI: 10.1126/science.1262073. Lima-Mendez G, Van Helden J, Toussaint A et al. Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics 2008;24:863–5. Lindell D, Jaffe JD, Johnson ZI et al. Photosynthesis genes in marine viruses yield proteins during host infection. Nature 2005;438:86–9. Liu B, Gibbons T, Ghodsi M et al. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 2011;12 (Supple 2):54. Loman NJ, Misra RV, Dallman TJ et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 2012;30:434–9. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 2011;27:764–70. Meinicke P. UProC: tools for ultra-fast protein domain classification. Bioinformatics 2015;31:1382–8. Minot S, Bryson A, Chehoud C et al. Rapid evolution of the human gut virome. Proc Natl Acad Sci USA 2013;110:12450–5. Minot S, Sinha R, Chen J et al. The human gut virome: interindividual variation and dynamic response to diet. Genome Res 2011;21:1616–25. Minot S, Wu GD, Lewis JD et al. Conservation of gene cassettes among diverse viruses of the human gut. PLoS One 2012;7:e42342. Namiki T, Hachiya T, Tanaka H et al. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res 2012;40:e155. Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 2006;34:5623–30.


12


Ondov BD, Treangen TJ, Mallonee AB et al. Fast genome and metagenome distance estimation using MinHash. bioRxiv 2015, DOI: http://dx.doi.org/10.1101/029827. Ounit R, Wanamaker S, Close TJ et al. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015;16:236. Paul JH. Prophages in marine bacteria: dangerous molecular time bombs or the key to survival in the seas? ISME J 2008;2:579–89. Pell J, Hintze A, Canino-Koning R et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci USA 2012;109:13272–7. Peng Y, Leung HCM, Yiu SM et al. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 2012;28:1420–8. Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol 2008;9:R55. Poulos B. Cesium chloride dialysis for viruses. 2015a, https:// www.protocols.io/view/Cesium-Chloride-Dialysis-for-Virusesc7jzkm (12 April 2016, date last accessed). Poulos B. DNase I Treatment. 2015b, https://www.protocols. io/view/DNase-I-Treatment-c3myk5 (12 January 2016, date last accessed). Poulos B. Purification of viruses using sucrose cushion. 2015c, https://www.protocols.io/view/Purifying-Viruses-UsingSucrose-Cushion-c3wypd (12 January 2016, date last accessed). Poulos B, Duhaime M, Deng L. DNA preparation and linker amplification for pyrosequencing. 2016, https://www.protocols.io/ view/DNA-Preparation-and-Linker-Amplification-for-Pyrosc9zz75. Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res 2010;38:e191. Rich V. DNA extraction from sorted cells. 2016, https://www. protocols.io/view/DNA-Extraction-from-sorted-cells-c38yrv (12 January 2016, date last accessed). Rohwer F. Phage tangential flow filtration. 2016, https:// www.protocols.io/view/Phage-Tangential-Flow-Filtrationc7yzpv (12 January 2016, date last accessed). Rohwer F, Thurber RV. Viruses manipulate the marine environment. Nature 2009;459:207–12. Rosen G, Garbarine E, Caseiro D et al. Metagenome fragment classification using N-mer frequency profiles. Adv Bioinformatics 2008, DOI: 10.1155/2008/205969. Rosen GL, Reichenberger ER, Rosenfeld AM. NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 2011;27:127–9. Roux S, Enault F, Hurwitz BL et al. VirSorter: mining viral signal from microbial genomic data. PeerJ 2015a;3:e985. Roux S, Hallam SJ, Woyke T et al. Viral dark matter and virushost interactions resolved from publicly available microbial genomes. Elife 2015b;4, DOI: 10.7554/eLife.08490. Roux S, Krupovic M, Debroas D et al. Assessment of viral community functional potential from viral metagenomes may be hampered by contamination with cellular sequences. Open Biol 2013;3:130160.

Salter SJ, Cox MJ, Turek EM et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 2014;12:87. Solonenko N. 16S PCR Universal. 2016, https://www.protocols. io/view/16S-PCR-Universal-c5my45 (12 January 2016, date last accessed). Solonenko SA, Sullivan MB. Preparation of metagenomic libraries from naturally occurring marine viruses. Methods Enzymol 2013;531:143–65. Sullivan MB, Lindell D, Lee JA et al. Prevalence and evolution of core photosystem II genes in marine cyanobacterial viruses and their hosts. PLoS Biol 2006;4:e234. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO et al. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 2008; 18:1979–90. Thomas T, Gilbert J, Meyer F. A 123 of Metagenomics. In: Nelson KE (ed.). Encyclopedia of Metagenomics. Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools. US: Springer, 2015, 1–9. Thurber RV, Haynes M, Breitbart M et al. Laboratory procedures to generate viral metagenomes. Nat Protoc 2009; 4:470–83. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet 2012;13:36–46. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res 2015;43:D204–12. Weitz JS, Poisot T, Meyer JR et al. Phage-bacteria infection networks. Trends Microbiol 2013;21:82–91. Wetzel J, Kingsford C, Pop M. Assessing the benefits of using mate-pairs to resolve repeats in de novo shortread prokaryotic assemblies. BMC Bioinformatics 2011;12: 95. Wittmann J, Riedel T, Bunk B et al. Complete genome sequence of the novel temperate Clostridium difficile phage phiCDIF1296T. Genome Announc 2015;3, DOI: 10.1128/genomeA.00839-15. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 2014;15:R46. Yilmaz S, Allgaier M, Hugenholtz P. Multiple displacement amplification compromises quantitative analysis of metagenomes. Nat Methods 2010;7:943–4. Yooseph S, Sutton G, Rusch DB et al. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biol 2007;5:e16. Youens-Clark K. Protocol for comparative metagenomics and network analyses using the Fizkin App in CyVerse. 2016, https://www.protocols.io/view/Modeling-ecological-driversin-marine-viral-commun-efgbbjw (21 January 2016, date last accessed). Zhao Y, Tang H, Ye Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 2012;28:125–6. Zhou Y, Liang Y, Lynch KH et al. PHAST: a fast phage search tool. Nucleic Acids Res 2011;39:W347–52.