User Manual

eager

EAGER User Manual

Alexander Peltzer August 24, 2015

1 CONCEPT OF THE PIPELINE

1

1

Concept of the pipeline

The pipeline consists of two main components: The graphical user interface called ‘eager‘ and the command line based execution part of EAGER, referred to as ‘eagercli‘. Methods and tools used in the pipeline can be seen in Figure 1. An introduction and some more detailed explanation of the corresponding methods and tools can be found in the publication, as well as in the citations of the tools used in the pipeline. However, a short explanation what users can expect of certain modules in EAGER can be found in Section 2. FastQ Preprocessing FastQC

Clip&Merge

FastQ

Read Mapping BWA, BWA-mem … Samtools

DeDup

MapDamage

QualiMap

Preseq

CircularMapper

SAM BAM Genotyping UnifiedGenotyper VariantFiltration

VCF2Genome

VCF

FastA

GATK: Preprocessing HaplotypeCaller

VCF

Figure 1: The general outline of the EAGER pipeline, containing all tools used in the pipeline.

2

EAGER’s main graphical interface

The Main window to start the configuration of EAGER is seen in Figure 2.

2 EAGER’S MAIN GRAPHICAL INTERFACE

2

Figure 2: The main GUI window of EAGER

It is used to configure a new EAGER run. On the top part, the input and output has to be specified. Then the number of cores and the number of memory that should be used by the tools can be specified. After that, the tools of the pipeline can be selected and configured.

2.1

RAW Data Input Selection

When clicking on the “Selection Input *.fq/*.fq.gz Files”, a file prompt is opened. This can be seen in Figure 3a. There it is possible to select specific fastq or fastq.gz files (see 3a or a folder with subfolders, where each subfolder contains these fastq files (see Figure 3b).


3

(a) File select of two individual fastq (b) File select of a folder with subfiles folders

(c) Window for defining the input type Figure 3: Selecting the input.

After the input files were selected a new window opens where the type of input data has to be defined (see Figure 3c). There the Organism can be chosen (Human/Bacterial), the age of the dataset (Ancient/Modern), if the data was UDG treated (UDG Treated/non-UDG Treated) and if the data was paired (Paired Data) or single Ended (Single-End Data). Additionally, if the data was capture data, then the capture method has to be specified (390K,1240K). The Merging can be skipped, if the data is already merged and it is possible to specify if the data has to be merged lanewise. The last point is if the data contains sequencing reads from the mitochondrium. Then this can be set and the name of the mitochondrium sequence in the reference has to be given here. This is used for the Circular Mapper so that it can be used on this input.


2.2

4

Reference Data Selection

When clicking on the “Select Reference” button, another file-chooser opens where the reference (in fasta format) has to be selected.

2.3

Output Folder Selection

When clicking on the “Select output folder” button, another file-chooser opens where the output folder has to be selected. This should be an empty folder as new folders are created by EAGER.

2.4

FastQC

It is possible to select or deselect FastQC. There are no extra parameters needed for this program.

2.5

Clip&Merge

There are several parameters that can be given to Clip and Merge. They can be entered by clicking on the “Advanced” Button next to it. The window to configure Clip and Merge can be seen in Figure 4

Figure 4: The GUI window of Clip and Merge


5

The forward and reverse adapters can be specified in addition with other advanced parameters that are given to the command line. (In order to see what they are, look at the specifications of Clip and Merge). Additionally the Minimum base quality can be entered. All bases with a lower quality are trimmed. Furthermore the Minimum Sequence Length of the read after Clipping, trimming and merging can be specified. All reads shorter than this are discarded. Lastly, the data doesn’t have to be merged. It is possible to only perform the adapter clipping. If Clip and Merge is disabled, it is possible to do another QualityFiltering. This is disabled by default and can only be used if Clip and Merge is disabled. There only the Minimum base quality,the Minimum read length and the possibility for advanced parameters need to be given.

2.6

Mapping

Here the different mapping algorithms can be chosen. Currently there are five different mapping algorithms implemented in EAGER: BWA, BWA-MEM, Circular Mapper, Bowtie2 and Stampy. Some of them have additional parameters that can be set like the seed length (BWA, BWA-MEM, Circular Mapper) or the Elongation Factor (Circular Mapper).

2.7

Complexity Estimation

This relies on the Preseq method, calculating library complexity which is useful for screening purposes. An estimation of how complex a sequencing library is will be produced, allowing users to decide whether sequencing a library any deeper would be cost-efficient or only result in a large proportion of duplicate reads and no real information gain.

2.8

Duplicate Removal

Sequencing data are often enriched before sequencing. Because of this it can happen that the same fragment is sequenced multiple times. This can be problematic for example for the correct calculation of the coverage. Because of this the DeDup Method removes these duplicates. However it is optimized for merged data. For new paired end data with positive insert sizes, the tool MarkDuplicates should be used.

2.9

Coverage and Statistics Calculation

Turning this feature on will produce several coverage plots and histograms, combined with statistics about e.g. endogenous DNA content, the cluster


6

factor, coverages on specific regions as well as several other statistics for the created BAM output files.

2.10

MapDamage Calculation

Running this module of EAGER will use mapDamage 2.0 to authenticate the DNA sample, running some plotting libraries to produce damage profiles and accompanying statistics, describing how much damage has been dealt to the corresponding DNA sample.

2.11

GATK SNP Calling

This module can perform genotyping on the dataset, using either the UnifiedGenotyper or HaplotypeCaller module provided by the Genome Analysis Toolkit (GATK). Under advanced, users can set several parameters: • SNP Reference: You can select for example a dbSNP VCF file here, in order to get rs IDs assigned to your called variants. • Ploidy of Organism: Set the correct ploidy of the organism you’re interested to analyze. • Standard Call Confidence: Set the standard call confidence you’d like the genotyper to use. • Standard Emit Confidence: Set the standard emit confidence you’d like the genotyper to use. • Downsampling: Set the downsampling value to a value you’d like to use. • Advanced Parameters: You may enter any command not requiring actual input data here. Subsequent values will be passed to the genotyper. • Emit All Sites?: If you select this, the genotyper will produce a call for every site available in the reference genome. Tick this if you’d like to create a consensus sequence or run a draft genome reconstruction. Note that this can result in very large VCF files and therefore increase runtime dramatically. • Emit Conf Sites?: Emit Confident Sites additionally to variant sites. For a more complete description what these values mean, lookup: http:// bit.ly/1HIrOel or http://bit.ly/1Dc4mXy.


2.12

7

GATK Variant Filtration

When selected, this can be used to apply a soft filter to the raw VCF file. Two filters are currently availble, one for filtering out calls made on positions with low quality. The second setting allows to filter out for calls below a certain coverage. Both filters can be adjusted in the advanced settings menu of the variant filtration.

2.13

VCF2Genome

The VCF2Genome tool allows to create draft genome of a provided VCF file. Note that this requires the input VCF file to contain a call for all positions on the reference genome, which is automatically taken care of by the pipeline. The advanced menu allows to configure several filtering criteria for the resulting draft genome file: • Minimal Genotyping Quality: This is used as a threshold to filter out positions that fall below this minimal genotyping quality. • Minimum Coverage: This is used to specify the minimum coverage required for a call to be trusted by the method. • Minimum SNP allele frequency: In the default settings, 90% of all reads must show a certain SNP at minimum in order for that SNP being taken into account for the draft genome. Variable positions with less than this value are not taken over into the final draft genome. • Draft genome header name: This is automatically set by default to the file name (usually the sample name). However, you may set this to something other than that.

2.14

CleanUp

The CleanUp procedure deletes intermediate results in the pipeline that are redundant. For example mapping results are usually produced in SAM format, converted to BAM and then subsequently sorted by the method. If this is turned on, the SAM format files will be deleted, as they are an uncompressed copy of the converted BAM files. Files that are only created once are always kept!

2.15

Create Report

In the end, it is possible to specify if a report of this EAGER run should be generated. It will be created in the chosen output folder with several statistics, explained in Section ??

3 EAGER’S COMMAND LINE INTERFACE

3

8

EAGER’s command line interface

The EAGER command line interface can be called depending on your installation by issuing: ”eagercli”. Basically the method can be used to process the previously configured analysis runs automatically. There are several potential possibilites on how to run such configuration files. • Single Mode: You may execute a single XML configuration file by issuing: ”eagercli input.xml” • Multi-Sample Mode: You may execute several XML configuration files by issuing: ”eagercli folder-to-samples/”. This way the command line interfaces will issue a file system search on the respective folder, determine which ones contain an appropriate XML file and run these XML files automatically for you. Independent of the execution mode, the pipeline will tell you on the command line interface of your workstation, on which file and sample the current thread is working. You will also get a quick statement how many samples have been found, allowing you to stop the pipeline if things have been configured incorrectly for example. If you need to stop the pipeline, you can start the pipeline again with the same command or on individual samples, without having to compute all the results again. Whenever the pipeline finds intermediate results created previously it refrains from starting from scratch, making the whole analysis process faster in case of e.g. a power failure or workstation crash.