Pipeline modules

Preprocessing modules

Diagnostics

These modules can be used to assess the quality of the PAR-CLIP library.

FastQCModule

Uses FastQC to create a diagnostic report of various quality aspects of the fastq reads.

Pipeline input: fastq

Pipeline output:

Parameter Default value Description
kmer_length 7 length of the kmers used for detecting overrepresented kmers
outdir_name fastQC name of the output directory that stores the report
extra_flags [] list of extra flags passed to the command line tool

BamPPModule

Postprocesses a bam alignment file and creates statistics of mutation rates and mismatch positions. Can clip mutations from the end of the reads.

Pipeline input: bam

Pipeline output: bam

Parameter Default value Description
remove_n_edge_mut 0 clip mutations that are at the n outermost bases in each alignment
max_mut_per_read 1 discard reads that have more mutations than the given integer
min_base_quality 0 discard reads that have at least one base with a quality lower than the given integer
min_avg_ali_quality 20 discard reads that have an average quality lower than the given integer
min_mismatch_quality 20 discard reads with mismatches of quality lower than the given integer
dump_raw_data False dump raw data for diagnostic purposes
outdir_name bam_analysis name of the output directory storing diagnostics and plots

SoftclipAnalysisModule

If the mapper allows soft-clipping this module shows the most frequently clipped sequences. The output can be used to understand common unclipped contamination. Creates two files 3prime_clipped.txt and 3prime_clipped.txt.

Interpretation of the output:

clipped bases: 2 | clipped sequences: 152

TG                                    128
CG                                      7
CT                                      4
CC                                      3
TT                                      3
TC                                      2
AT                                      1
CA                                      1
GG                                      1
GT                                      1

The header line contains the number of clipped bases and the number of affected alignments. Below the most commonly clipped sequences are enumerated along with the number of affected alignments in descending order.

Pipeline input: bam

Pipeline output:

Parameter Default value Description
outdir_name bam_analysis name of the output directory storing diagnostics and plots

TransitionFrequencyModule

Plots the frequency of transitions in a bar plot.

Pipeline input: mpileup

Pipeline output:

Parameter Default value Description
output_prefix   file name prefix of the output file
min_cov 5 only consider sites with at least this coverage
y_axis_limit 0 limit of the y axis
remove_tmp_files True remove temporary files

Shows the observed substitution rates for all possible base substitutions. The T->C transitions is displayed as P(C|T).

Example output:

_images/transition_freq.png

Duplicate removal

In order to make meaningful biological conclusions, PCR duplicates should be removed after sequencing. We offer two modules here. A naive approach implemented in DuplicateRemovalModule and two modules using umi_tools, applicable when the library contains unique molecular identifiers (UMIs).

DuplicateRemovalModule

A poor man’s approach to duplicate removal: this module removes fastq reads with the same sequence. If the PAR-CLIP library uses UMIs, it is recommended to use the more sophisticated modules based on umi_tools. Please refer to UmiToolsDedupModule and UmiToolsExtractModule.

Pipeline input: fastq

Pipeline output: fastq

UmiToolsExtractModule

Together with UmiToolsDedupModule this is the preferred way to deduplicate PAR-CLIP libraries that use UMIs. This module extracts the UMI and appends it to the read name. As such it is recommended to run this module as one of the first steps in a pipeline. After mapping the UmiToolsDedupModule module can be used to remove duplicated reads.

For further information, please also check the documentation of umi_tools [SHS17].

Parameter Default value Description
extra_flags [] list of extra flags passed to the command line tool

Pipeline input: fastq

Pipeline output: fastq

UmiToolsDedupModule

Together with UmiToolsExtractModule this is the preferred way to deduplicate PAR-CLIP libraries that use UMIs. This module deduplicates bam files based on extracted UMIs. This module has to be run after UmiToolsExtractModule.

For further information, please also check the documentation of umi_tools [SHS17].

Pipeline input: bam

Pipeline output: bam

Parameter Default value Description
extra_flags [] list of extra flags passed to the command line tool

Adapter clipping

Adapter clipping is an important step in PAR-CLIP libraries: in case of very small inserts, the beginning of the 3’ adapter is present in the reads. SkewerAdapterClippingModule is the module of choice for removing these adapters.

ClippyAdapterClippingModule clips adapters with less sensitity, but also detects partial 5’ adapters. It can be useful when dealing with libraries that were generated from very low amounts of RNA.

SkewerAdapterClippingModule

This module uses skewer [JLDZ14] to trim the 3’ adapter from the PAR-CLIP reads. Additional arguments can be directly passed to skewer’s command line call. Please consult skewer’s documentation for a detailed description of all available options.

Pipeline input: fastq

Pipeline output: fastq

Parameter Default value Description
extra_args [] list of extra flags passed to the command line tool

ClippyAdapterClippingModule

This module removes longer traces of adapters by looking for perfect matches of the adapter ends. It also removes random barcodes and adapters. If you are using UmiToolsExtractModule to extract the UMIs don’t forget to set clipped_5prime_bc to True.

Pipeline input: fastq

Pipeline output: fastq

Parameter Default value Description
clip_len 10 minimum base pairs required to be detected as adapter sequence
clipped_5prime_bc False UMIs already removed from the 5’ end

Mapping

With Bowtie and STAR we offer two very different alignment strategies. STAR can map spliced reads and use softclipping to remove contaminants automatically. Soft clipping however prevents from mapping transitions at either end of the alignment. Mappers feed unmapped reads back in the pipeline and thus can be chained.

STARMapModule

STAR [DDS+13] is a general purpose RNA-seq data mapper. Unmapped reads are returned to the pipeline in fastq format. For detailed configuration options, please also refer to STAR’s user manual.

Pipeline input: fastq

Pipeline output: bam, fastq

Parameter Default value Description
genome_index   path to the directory containing the STAR genome index
n_mismatch 1 maximum number of allowed mismatches
n_multimap 1 maximum number of mapping positions. Maps uniquely by default
allow_soft_clipping True enable softclipping
outdir_name star_out name of the output directory
extra_flags [] additional commandline options passed to STAR

BowtieMapModule

Bowtie [Lan10] is a genomic aligner and as such cannot map spliced reads. Unmapped reads are re-queued in the pipeline.

Pipeline input: fastq

Pipeline output: bam, fastq

Parameter Default value Description
genome_index   prefix of bowtie’s genome index
n_mismatch 1 maximum number of allowed mismatches
n_multimap 1 maximum number of mapping positions. Maps uniquely by default
extra_flags [] additional commandline options passed to bowtie

Binding site prediction

We offer two different strategies for predicing binding sites. BSFinderModule does not require mock information and therefore cannot distinguish background binding from factor specific binding events. MockinbirdModule is the recommended binding site predictor. It requires parameters trained on a mock experiment.

BSFinderModule

BSFinder calculates p-values by learning a statistical model on non-specific conversion events. This model cannot distinguish background binding from factor specific binding events. The prediction algorithm was presented in [Tor15].

Pipeline input: mpileup

Pipeline output: table

Parameter Default value Description
pval_threshold 0.005 only sites with a p-value smaller than this are reported
min_cov 2 minimum coverage of reported binding sites

NaiveBSFinderModule

The naive binding site finder predicts all sites that pass a specified minimum threshold of conversion events.

Pipeline input: mpileup

Pipeline output: table

Parameter Default value Description
min_transitions 2 only sites with at least this many conversion events are reported

MockinbirdModule

Mockinbird is the core module that predicts binding sites by harnessing information from a mock experiments. Its input files are generated by the modules in the Mockinbird related modules section.

Pipeline input: trtable, mock_model

Pipeline output: table

Parameter Default value Description
plot_dir mockinbird_plots directory for writing out diagnostic plots
max_k_mock 10 sites with more specific conversions than max_k_mock are discarded
extra_args [] additional arguments directly passed to the called script

Miscellaneous

A collection of miscellaneous helper modules.

SortIndexModule

The SortIndexModule sorts and indexes a bam files using samtools. This is generally required before generating a pileup file.

Pipeline input: bam

Pipeline output: bam

PileupModule

Uses samtools to create a pileup file. Pileup report coverage and transitions per genomic base and are the input of our predictors. Please be aware that the bam file has to be sorted. If in doubt, queue after the SortIndexModule.

Pipeline input: bam

Pipeline output: mpileup

NormalizationModule

The normalization module calculates an occupancy by dividing the number of observed transitions by the coverage of a reference experiment. The appropriate reference experiment should reflect the pool of RNA the factor sees when choosing where to bind. Depending on the binding properties of the protein of interest, an RNA-seq experiment under PAR-CLIP conditions, PAR-CLIP of the RNA polymerase or protocols to capture transient binding such as 4SU-seq may be appropriate.

Additionally, SNPs are removed by detecting elevated conversion rates in the normalization pileup file.

Pipeline input: table

Pipeline output: table

Parameter Default value Description
mut_snp_ratio 0.75 ratio of conversations to coverage in the normalization pileup for a site being detected as SNP

QuantileCapModule

Caps the occupancy value at a given quantile. This module can help removing the influence of outliers on downstream analyses, such as the gene plot.

Pipeline input: table

Pipeline output: table

Parameter Default value Description
max_quantile 0.95 all occupancy values are capped to the value of this quantile

Table2FastaModule

Converts a binding site table file to fasta by extracting the genomic sequence around the binding site.

Pipeline input: table

Pipeline output: fasta

Parameter Default value Description
genome_fasta   path to the genome fasta file

Postprocessing modules

Plots

CenterPlotBSModule

This module plots a bootstrapped metagene plot fixed at the start and end of each annotations.

Pipeline input: table

Pipeline output:

Parameter Default value Description
gff_file   gff file of annotations used for the metagene plot
output_prefix   file name prefix for the output files
labelCenterA   label for the metagene start position
labelCenterB   label for the metagene end position
labelBody   label for the metagene body
downstream_bp 1000 number of base pairs shown downstream of start
upstream_bp 1000 number of base pairs shown upstream of the end
gene_bp 750 number of base pairs inside the annotation, i.e. downstream of the start and upstream of the end
min_trscr_size_bp 1500 filter out all transcript shorter than this size
max_trscr_size_bp 100000 filter out all trascript longer than this size
smoothing_window 20 size of the window used for smoothing the profile (in bp)
remove_tmp_files True clean up temporary files
bootstrap_iter 2500 number of bootstrap iterations
n_processes 4 number of parallel processes

Example plot:

size metagene plot

A plot of the PAR-CLIP occupancy around the start and end sites of genomic annoations. The occupancy profile of the sense strand is depicted in blue, occupancy on the anti-sense strand is shown in green. The profiles are smoothed with a running mean. 95%-confidence intervals are calculated by bootstrap sampling the annotations.

The bottom of the plot shows the occupancy profiles in heatmap representation.

KmerPerPositionModule

Plots the kmer occurence frequencies around binding sites.

Pipeline input: table

Pipeline output:

Parameter Default value Description
genome_fasta   path to genome in fasta format (requires index)
output_prefix   file name prefix for the output files
kmer_k 3 length of kmers
first_index 0 index of the first site plotted
last_index 1500 index of the last site plotted
width 50 number of base pairs around the binding site considered
sort_key occupancy sort column of the table; valid values are occupancy, transitions, coverage and score.
gff_exclude_path ‘’ sites that overlap one of the annotations in this file are dropped
gff_padding 20 annotations are extended by this amount of base pairs around start and end
remove_tmp_files True clean up temporary files

HeatmapPlotModule

Plot a heatmap of all transcripts matching given length criteria as a heat map. Using binning over transcripts and transcript lenth.

Pipeline input: table

Pipeline output:

Parameter Default value Description
gff_file   gff file of annotations used for the metagene plot
output_prefix   file name prefix for the output files
downstream_bp 4000 number of base pairs shown downstream of start
upstream_bp 1000 number of base pairs shown upstream of the end
min_trscr_size_bp 0 filter out all transcript shorter than this size
max_trscr_size_bp 5000 filter out all transcripts longer than this size
x_bins 500 number of bins in x direction (transcript length)
y_bins 500 number of bins in y direction (grouping transcripts)
x_pixels 500 pixel in x directions
y_pixels 500 pixels in y direction
remove_tmp_files True clean up temporary files

HeatmapSmallPlotModule

Plot a heatmap of all transcripts matching given length criteria as a heat map. Using binning over transcripts and transcript lenth. Similar to HeatmapPlotModule, but more consise plot.

Pipeline input: table

Pipeline output:

Parameter Default value Description
gff_file   gff file of annotations used for the metagene plot
output_prefix   file name prefix for the output files
downstream_bp 500 number of base pairs shown downstream of start
upstream_bp 1000 number of base pairs shown upstream of the end
min_trscr_size_bp 0 filter out all transcript shorter than this size
max_trscr_size_bp 5000 filter out all transcripts longer than this size
x_bins 500 number of bins in x direction (transcript length)
y_bins 500 number of bins in y direction (grouping transcripts)
x_pixels 500 pixel in x directions
y_pixels 500 pixels in y direction
remove_tmp_files True clean up temporary files

Motif detection

XXmotifModule

Runs XXmotif, a tool for de-novo detection of overrepresented motifs.

Pipeline input: table

Pipeline output:

Parameter Default value Description
genome_fasta   path to genome in fasta format (requires index)
output_prefix   file name prefix for the output files
negative_set_gff   path to a gff file used for sampling the negative set
n_negative seqs 20000 number of negative sequences sampled
first_index 0 index of the first site plotted
last_index 1500 index of the last site plotted
width 12 number of base pairs around the binding site considered
sort_key occupancy sort column of the table; valid values are occupancy, transitions, coverage and score.
gff_exclude_path ‘’ sites that overlap one of the annotations in this file are dropped
gff_padding 20 annotations are extended by this amount of base pairs around start and end
remove_tmp_files True clean up temporary files

Miscellaneous

GffFilterModule

Filter sites that overlap with a given gff annotation. Can be used to filter sites in highly abundant transcripts such as tRNA or rRNA. Pass the name of the gff features you want to exclude to the features option.

Pipeline input: table

Pipeline output: table

Parameter Default value Description
filter_gff   path to gff file used for filtering
file_postfix fil postfix appended to the table name
padding_bp 20 annotations are extended by this amount of base pairs
features [] list of features that are filtered. Excludes all by default.

Writing your own modules

A straightforward way to implement your own pipeline module is by subclassing CmdPipelineModule. Here we use the SkewerAdapterClippingModule as an example.

from mockinbird.utils import pipeline as pl
class SkewerAdapterClippingModule(pl.CmdPipelineModule):

Adding module arguments

If your module accepts arguments, you have to modify the constructor __init__(self, pipeline):

def __init__(self, pipeline):
    cfg_fmt = [
       ('extra_args', cv.Annot(list, default=[])),
    ]
    super().__init__(pipeline, cfg_req=cfg_fmt)

Each argument is a tuple consisting of a name, here extra_args and an annotation object Annot. Upon construction the annotation takes following keyword arguments:

type:
a callable that represents the base type of the value, here a list
default:
the default value if the option was not provided by the user. The default value None makes setting the value in the configuration file mandatory.
converter:
a callable that validates and converts the value the user set in the config file. Can raise a ValueError, if the user entered an invalid value.

By calling super().__init__() the list of arguments cfg_fmt is passed to the parent constructor.

Following configurations are now valid in an configuration file:

[...]
- SkewerAdapterClippingModule:
    extra_args:
      - -k 12
      - -d 0.01
[...]
- SkewerAdapterClippingModule:
    extra_args: []

which is equivalent to falling back to the default argument

[...]
- SkewerAdapterClippingModule

We define a variety of validators that can readily be used. Please refer to Configuration for details.

Preparing the module

prepare(self, cfg) is the heart of the module and defines the commands that are executed when the module runs. It also registers the outputs and thus makes new files visible to downstream modules.

cfg is a dict of all configuration options set by the user. Configuration of all options requested as described in Adding module arguments can be accessed by their names.

The first action in the prepare method has to be the call to the parent’s prepare method:

def prepare(self, cfg):
   super().prepare(cfg)

Information from previously run modules and global configuration options can be obtained through a reference to Pipeline, which can be accessed through the private attribute self._pipeline.

def prepare(self, cfg):
   super().prepare(cfg)
   pipeline = self._pipeline
   general_cfg = pipeline.get_config('general')
   read_cfg = pipeline.get_config('reads')
   output_dir = general_cfg['output_dir']
   prefix = general_cfg['prefix']

Here we access the global configuration sections general and reads to obtain the path to the output directory and the file name prefix.

Pipeline.get_curfile() is used to obtain the path to files that are queued in the pipeline. The pipeline stores the most recent file path of each format.

def prepare(self, cfg):
    [...]
    fastq_file = pipeline.get_curfile(fmt='fastq')

Here we obtain the path to the most recently queued fastq file.

Now we have all information to construct the command line call for running skewer. The output path is stored in adapter_clipped_file, a file path relative to the output directory passed by the command line script.

def prepare(self, cfg):
    [...]
    adapter_clipped_file = os.path.join(output_dir, prefix + '_skewer.clipped')
    cmd = [
        'skewer',
        fastq_file,
        '-x %s' % general_cfg['adapter3prime'],
        '-m tail',
        '--min %s' % read_cfg['min_len'],
        '--quiet',
        '--stdout',
        '> %r' % adapter_clipped_file,
    ]
    if cfg['extra_args']:
        cmd.extend(cfg['extra_args'])
    self._cmds.append(cmd)

Having collected the command as a list of arguments in cmd, we append all user defined arguments that are conveniently stored in cfg and queue the command by appending it to the private self._cmds variable.

self._intermed_files.append(adapter_clipped_file)
pipeline.upd_curfile(fmt='fastq', filepath=adapter_clipped_file)

As a final step we register adapter_clipped_file as an intermediate file, meaning that it can be deleted provided this module is not the last in the pipeline.

Pipeline.upd_curfile() registers the file as new fastq file and thus makes it visible to following modules.

Note: At time prepare() is called, files to the passed paths may or may not exist. Only queue new commands by appending to self._cmds, never try to execute commands.

The full module

Putting everything together, the code of the SkewerAdapterClippingModule looks like this:

import os
from mockinbird.utils import pipeline as pl
from mockinbird.utils import config_validation as cv

class SkewerAdapterClippingModule(pl.CmdPipelineModule):

    def __init__(self, pipeline):
        cfg_fmt = [
            ('extra_args', cv.Annot(list, default=[])),
        ]
        super().__init__(pipeline, cfg_req=cfg_fmt)

    def prepare(self, cfg):
        super().prepare(cfg)
        pipeline = self._pipeline
        general_cfg = pipeline.get_config('general')
        read_cfg = pipeline.get_config('reads')
        output_dir = general_cfg['output_dir']
        prefix = general_cfg['prefix']
        fastq_file = pipeline.get_curfile(fmt='fastq')

        adapter_clipped_file = os.path.join(output_dir, prefix + '_skewer.clipped')

        cmd = [
            'skewer',
            fastq_file,
            '-x %s' % general_cfg['adapter3prime'],
            '-m tail',
            '--min %s' % read_cfg['min_len'],
            '--quiet',
            '--stdout',
            '> %r' % adapter_clipped_file,
        ]

        if cfg['extra_args']:
            cmd.extend(cfg['extra_args'])
        self._cmds.append(cmd)

        self._intermed_files.append(adapter_clipped_file)
        pipeline.upd_curfile(fmt='fastq', filepath=adapter_clipped_file)

The parent class CmdPipelineModule and the Pipeline take care of all required steps such as config parsing and validation, executing commands and cleaning up files.

Custom modules in config files

Having created your own module, you have to make the class importable in python. The cleanest way to achieve that is to make your module installable by writing a setup.py. For more information please refer to available documentation, e.g. here.

In the config file you have to refer to the module with its full import path, e.g.:

- mypackage.mymodule.SkewerAdapterClippingModule:
  extra_args:
    - -k 2