Pipeline modules¶

Preprocessing modules¶

Diagnostics¶

These modules can be used to assess the quality of the PAR-CLIP library.

FastQCModule¶

Uses FastQC to create a diagnostic report of various quality aspects of the fastq reads.

Pipeline input: fastq

Pipeline output:

Parameter	Default value	Description
kmer_length	7	length of the kmers used for detecting overrepresented kmers
outdir_name	fastQC	name of the output directory that stores the report
extra_flags	[]	list of extra flags passed to the command line tool

BamPPModule¶

Postprocesses a bam alignment file and creates statistics of mutation rates and mismatch positions. Can clip mutations from the end of the reads.

Pipeline input: bam

Pipeline output: bam

Parameter	Default value	Description
remove_n_edge_mut	0	clip mutations that are at the n outermost bases in each alignment
max_mut_per_read	1	discard reads that have more mutations than the given integer
min_base_quality	0	discard reads that have at least one base with a quality lower than the given integer
min_avg_ali_quality	20	discard reads that have an average quality lower than the given integer
min_mismatch_quality	20	discard reads with mismatches of quality lower than the given integer
dump_raw_data	False	dump raw data for diagnostic purposes
outdir_name	bam_analysis	name of the output directory storing diagnostics and plots

SoftclipAnalysisModule¶

If the mapper allows soft-clipping this module shows the most frequently clipped sequences. The output can be used to understand common unclipped contamination. Creates two files 3prime_clipped.txt and 3prime_clipped.txt.

Interpretation of the output:

clipped bases: 2 | clipped sequences: 152

TG                                    128
CG                                      7
CT                                      4
CC                                      3
TT                                      3
TC                                      2
AT                                      1
CA                                      1
GG                                      1
GT                                      1

The header line contains the number of clipped bases and the number of affected alignments. Below the most commonly clipped sequences are enumerated along with the number of affected alignments in descending order.

Pipeline input: bam

Pipeline output:

Parameter	Default value	Description
outdir_name	bam_analysis	name of the output directory storing diagnostics and plots

TransitionFrequencyModule¶

Plots the frequency of transitions in a bar plot.

Pipeline input: mpileup

Pipeline output:

Parameter	Default value	Description
output_prefix		file name prefix of the output file
min_cov	5	only consider sites with at least this coverage
y_axis_limit	0	limit of the y axis
remove_tmp_files	True	remove temporary files

Shows the observed substitution rates for all possible base substitutions. The T->C transitions is displayed as P(C|T).

Example output:

Duplicate removal¶

In order to make meaningful biological conclusions, PCR duplicates should be removed after sequencing. We offer two modules here. A naive approach implemented in DuplicateRemovalModule and two modules using umi_tools, applicable when the library contains unique molecular identifiers (UMIs).

DuplicateRemovalModule¶

A poor man’s approach to duplicate removal: this module removes fastq reads with the same sequence. If the PAR-CLIP library uses UMIs, it is recommended to use the more sophisticated modules based on umi_tools. Please refer to UmiToolsDedupModule and UmiToolsExtractModule.

Pipeline input: fastq

Pipeline output: fastq

UmiToolsExtractModule¶

Together with UmiToolsDedupModule this is the preferred way to deduplicate PAR-CLIP libraries that use UMIs. This module extracts the UMI and appends it to the read name. As such it is recommended to run this module as one of the first steps in a pipeline. After mapping the UmiToolsDedupModule module can be used to remove duplicated reads.

For further information, please also check the documentation of umi_tools [SHS17].

Parameter	Default value	Description
extra_flags	[]	list of extra flags passed to the command line tool

Pipeline input: fastq

Pipeline output: fastq

UmiToolsDedupModule¶

Together with UmiToolsExtractModule this is the preferred way to deduplicate PAR-CLIP libraries that use UMIs. This module deduplicates bam files based on extracted UMIs. This module has to be run after UmiToolsExtractModule.

For further information, please also check the documentation of umi_tools [SHS17].

Pipeline input: bam

Pipeline output: bam

Parameter	Default value	Description
extra_flags	[]	list of extra flags passed to the command line tool

Adapter clipping¶

Adapter clipping is an important step in PAR-CLIP libraries: in case of very small inserts, the beginning of the 3’ adapter is present in the reads. SkewerAdapterClippingModule is the module of choice for removing these adapters.

ClippyAdapterClippingModule clips adapters with less sensitity, but also detects partial 5’ adapters. It can be useful when dealing with libraries that were generated from very low amounts of RNA.

SkewerAdapterClippingModule¶

This module uses skewer [JLDZ14] to trim the 3’ adapter from the PAR-CLIP reads. Additional arguments can be directly passed to skewer’s command line call. Please consult skewer’s documentation for a detailed description of all available options.

Pipeline input: fastq

Pipeline output: fastq

Parameter	Default value	Description
extra_args	[]	list of extra flags passed to the command line tool

ClippyAdapterClippingModule¶

This module removes longer traces of adapters by looking for perfect matches of the adapter ends. It also removes random barcodes and adapters. If you are using UmiToolsExtractModule to extract the UMIs don’t forget to set clipped_5prime_bc to True.

Pipeline input: fastq

Pipeline output: fastq

Parameter	Default value	Description
clip_len	10	minimum base pairs required to be detected as adapter sequence
clipped_5prime_bc	False	UMIs already removed from the 5’ end

Mapping¶

With Bowtie and STAR we offer two very different alignment strategies. STAR can map spliced reads and use softclipping to remove contaminants automatically. Soft clipping however prevents from mapping transitions at either end of the alignment. Mappers feed unmapped reads back in the pipeline and thus can be chained.

STARMapModule¶

STAR [DDS+13] is a general purpose RNA-seq data mapper. Unmapped reads are returned to the pipeline in fastq format. For detailed configuration options, please also refer to STAR’s user manual.

Pipeline input: fastq

Pipeline output: bam, fastq

Parameter	Default value	Description
genome_index		path to the directory containing the STAR genome index
n_mismatch	1	maximum number of allowed mismatches
n_multimap	1	maximum number of mapping positions. Maps uniquely by default
allow_soft_clipping	True	enable softclipping
outdir_name	star_out	name of the output directory
extra_flags	[]	additional commandline options passed to STAR

BowtieMapModule¶

Bowtie [Lan10] is a genomic aligner and as such cannot map spliced reads. Unmapped reads are re-queued in the pipeline.

Pipeline input: fastq

Pipeline output: bam, fastq

Parameter	Default value	Description
genome_index		prefix of bowtie’s genome index
n_mismatch	1	maximum number of allowed mismatches
n_multimap	1	maximum number of mapping positions. Maps uniquely by default
extra_flags	[]	additional commandline options passed to bowtie

Binding site prediction¶

We offer two different strategies for predicing binding sites. BSFinderModule does not require mock information and therefore cannot distinguish background binding from factor specific binding events. MockinbirdModule is the recommended binding site predictor. It requires parameters trained on a mock experiment.

BSFinderModule¶

BSFinder calculates p-values by learning a statistical model on non-specific conversion events. This model cannot distinguish background binding from factor specific binding events. The prediction algorithm was presented in [Tor15].

Pipeline input: mpileup

Pipeline output: table

Parameter	Default value	Description
pval_threshold	0.005	only sites with a p-value smaller than this are reported
min_cov	2	minimum coverage of reported binding sites

NaiveBSFinderModule¶

The naive binding site finder predicts all sites that pass a specified minimum threshold of conversion events.

Pipeline input: mpileup

Pipeline output: table

Parameter	Default value	Description
min_transitions	2	only sites with at least this many conversion events are reported

MockinbirdModule¶

Mockinbird is the core module that predicts binding sites by harnessing information from a mock experiments. Its input files are generated by the modules in the Mockinbird related modules section.

Pipeline input: trtable, mock_model

Pipeline output: table

Parameter	Default value	Description
plot_dir	mockinbird_plots	directory for writing out diagnostic plots
max_k_mock	10	sites with more specific conversions than `max_k_mock` are discarded
extra_args	[]	additional arguments directly passed to the called script

Miscellaneous¶

A collection of miscellaneous helper modules.

SortIndexModule¶

The SortIndexModule sorts and indexes a bam files using samtools. This is generally required before generating a pileup file.

Pipeline input: bam

Pipeline output: bam

PileupModule¶

Uses samtools to create a pileup file. Pileup report coverage and transitions per genomic base and are the input of our predictors. Please be aware that the bam file has to be sorted. If in doubt, queue after the SortIndexModule.

Pipeline input: bam

Pipeline output: mpileup

NormalizationModule¶

The normalization module calculates an occupancy by dividing the number of observed transitions by the coverage of a reference experiment. The appropriate reference experiment should reflect the pool of RNA the factor sees when choosing where to bind. Depending on the binding properties of the protein of interest, an RNA-seq experiment under PAR-CLIP conditions, PAR-CLIP of the RNA polymerase or protocols to capture transient binding such as 4SU-seq may be appropriate.

Additionally, SNPs are removed by detecting elevated conversion rates in the normalization pileup file.

Pipeline input: table

Pipeline output: table

Parameter	Default value	Description
mut_snp_ratio	0.75	ratio of conversations to coverage in the normalization pileup for a site being detected as SNP

QuantileCapModule¶

Caps the occupancy value at a given quantile. This module can help removing the influence of outliers on downstream analyses, such as the gene plot.

Pipeline input: table

Pipeline output: table

Parameter	Default value	Description
max_quantile	0.95	all occupancy values are capped to the value of this quantile

Table2FastaModule¶

Converts a binding site table file to fasta by extracting the genomic sequence around the binding site.

Pipeline input: table

Pipeline output: fasta

Parameter	Default value	Description
genome_fasta		path to the genome fasta file

Mockinbird related modules¶

MockinbirdModule requires as input a joint dataset of factor of interest and the mock measurement and parameters estimated on the mock. The following modules can be used to create the required input data.

PredictionSitesModule¶

This module creates a file that contains all sites that are considered in the prediction of binding sites. By default these are all genomic sites that have the transition nucleotide on either strand. This can be restricted by giving gff files of genomic regions of interest.

Pipeline input:

Pipeline output: sites

Parameter	Default value	Description
sites_file		path to sites file. Will be created if does not exist yet. Will not be recreated if already existing.
fasta_file		path to genomic fasta file
gff_file	‘’	gff file for restricting predictions to specific regions
transition_nucleotide	T	nucleotide that converts in the PAR-CLIP experiment

MockTableModule¶

Converts the pileup file from the mock experiment to a mock table. Required by the TransitionTableModule.

Pipeline input:

Pipeline output: mocktable

Parameter	Default value	Description
mock_pileup		path to the mock pileup file
mock_table		path to mock table. Will be created if does not exist yet. Will not be recreated if already existing.

TransitionTableModule¶

Combines mock table and factor pileup file to the so called transition table. Depends on the outputs of MockTableModule, PredictionSitesModule.

Pipeline input: sites, mock_table, mpileup

Pipeline output: trtable

LearnMockModule¶

Learns the model parameters from the transition table. Requires inputs from TransitionTableModule and BamStatisticsModule.

Pipeline input: trtable

Pipeline output: mock_model

Parameter	Default value	Description
mock_model		path to the mock model pickle file. Will be created if does not exist. Will not be recreated if already existing
mock_statistics		path to the mock bam statistics
n_mixture_components	5	number of mixture components for fitting the geometric mixture models
em_iterations	250	number of iterations of the EM algorithm fitting the geometric mixture model

BamStatisticsModule¶

This module stores additional information from a bam file in a json file. If the predicted sites are restrained by a gff file in PredictionSitesModule, the same gff file should be used for generating the statistics.

Pipeline input: bam

Pipeline output: stat_file

Parameter	Default value	Description
gff_file	‘’	gff file for restricting the prediction sites

Postprocessing modules¶

Plots¶

CenterPlotBSModule¶

This module plots a bootstrapped metagene plot fixed at the start and end of each annotations.

Pipeline input: table

Pipeline output:

Parameter	Default value	Description
gff_file		gff file of annotations used for the metagene plot
output_prefix		file name prefix for the output files
labelCenterA		label for the metagene start position
labelCenterB		label for the metagene end position
labelBody		label for the metagene body
downstream_bp	1000	number of base pairs shown downstream of start
upstream_bp	1000	number of base pairs shown upstream of the end
gene_bp	750	number of base pairs inside the annotation, i.e. downstream of the start and upstream of the end
min_trscr_size_bp	1500	filter out all transcript shorter than this size
max_trscr_size_bp	100000	filter out all trascript longer than this size
smoothing_window	20	size of the window used for smoothing the profile (in bp)
remove_tmp_files	True	clean up temporary files
bootstrap_iter	2500	number of bootstrap iterations
n_processes	4	number of parallel processes

Example plot:

A plot of the PAR-CLIP occupancy around the start and end sites of genomic annoations. The occupancy profile of the sense strand is depicted in blue, occupancy on the anti-sense strand is shown in green. The profiles are smoothed with a running mean. 95%-confidence intervals are calculated by bootstrap sampling the annotations.

The bottom of the plot shows the occupancy profiles in heatmap representation.

KmerPerPositionModule¶

Plots the kmer occurence frequencies around binding sites.

Pipeline input: table

Pipeline output:

Parameter	Default value	Description
genome_fasta		path to genome in fasta format (requires index)
output_prefix		file name prefix for the output files
kmer_k	3	length of kmers
first_index	0	index of the first site plotted
last_index	1500	index of the last site plotted
width	50	number of base pairs around the binding site considered
sort_key	occupancy	sort column of the table; valid values are occupancy, transitions, coverage and score.
gff_exclude_path	‘’	sites that overlap one of the annotations in this file are dropped
gff_padding	20	annotations are extended by this amount of base pairs around start and end
remove_tmp_files	True	clean up temporary files

HeatmapPlotModule¶

Plot a heatmap of all transcripts matching given length criteria as a heat map. Using binning over transcripts and transcript lenth.

Pipeline input: table

Pipeline output:

Parameter	Default value	Description
gff_file		gff file of annotations used for the metagene plot
output_prefix		file name prefix for the output files
downstream_bp	4000	number of base pairs shown downstream of start
upstream_bp	1000	number of base pairs shown upstream of the end
min_trscr_size_bp	0	filter out all transcript shorter than this size
max_trscr_size_bp	5000	filter out all transcripts longer than this size
x_bins	500	number of bins in x direction (transcript length)
y_bins	500	number of bins in y direction (grouping transcripts)
x_pixels	500	pixel in x directions
y_pixels	500	pixels in y direction
remove_tmp_files	True	clean up temporary files

HeatmapSmallPlotModule¶

Plot a heatmap of all transcripts matching given length criteria as a heat map. Using binning over transcripts and transcript lenth. Similar to HeatmapPlotModule, but more consise plot.

Pipeline input: table

Pipeline output:

Parameter	Default value	Description
gff_file		gff file of annotations used for the metagene plot
output_prefix		file name prefix for the output files
downstream_bp	500	number of base pairs shown downstream of start
upstream_bp	1000	number of base pairs shown upstream of the end
min_trscr_size_bp	0	filter out all transcript shorter than this size
max_trscr_size_bp	5000	filter out all transcripts longer than this size
x_bins	500	number of bins in x direction (transcript length)
y_bins	500	number of bins in y direction (grouping transcripts)
x_pixels	500	pixel in x directions
y_pixels	500	pixels in y direction
remove_tmp_files	True	clean up temporary files

Motif detection¶

XXmotifModule¶

Runs XXmotif, a tool for de-novo detection of overrepresented motifs.

Pipeline input: table

Pipeline output:

Parameter	Default value	Description
genome_fasta		path to genome in fasta format (requires index)
output_prefix		file name prefix for the output files
negative_set_gff		path to a gff file used for sampling the negative set
n_negative seqs	20000	number of negative sequences sampled
first_index	0	index of the first site plotted
last_index	1500	index of the last site plotted
width	12	number of base pairs around the binding site considered
sort_key	occupancy	sort column of the table; valid values are occupancy, transitions, coverage and score.
gff_exclude_path	‘’	sites that overlap one of the annotations in this file are dropped
gff_padding	20	annotations are extended by this amount of base pairs around start and end
remove_tmp_files	True	clean up temporary files

Miscellaneous¶

GffFilterModule¶

Filter sites that overlap with a given gff annotation. Can be used to filter sites in highly abundant transcripts such as tRNA or rRNA. Pass the name of the gff features you want to exclude to the features option.

Pipeline input: table

Pipeline output: table

Parameter	Default value	Description
filter_gff		path to gff file used for filtering
file_postfix	fil	postfix appended to the table name
padding_bp	20	annotations are extended by this amount of base pairs
features	[]	list of features that are filtered. Excludes all by default.

Writing your own modules¶

A straightforward way to implement your own pipeline module is by subclassing CmdPipelineModule. Here we use the SkewerAdapterClippingModule as an example.

from mockinbird.utils import pipeline as pl
class SkewerAdapterClippingModule(pl.CmdPipelineModule):

Adding module arguments¶

If your module accepts arguments, you have to modify the constructor __init__(self, pipeline):

def __init__(self, pipeline):
    cfg_fmt = [
       ('extra_args', cv.Annot(list, default=[])),
    ]
    super().__init__(pipeline, cfg_req=cfg_fmt)

Each argument is a tuple consisting of a name, here extra_args and an annotation object Annot. Upon construction the annotation takes following keyword arguments:

type:: a callable that represents the base type of the value, here a list
default:: the default value if the option was not provided by the user. The default value None makes setting the value in the configuration file mandatory.
converter:: a callable that validates and converts the value the user set in the config file. Can raise a ValueError, if the user entered an invalid value.

By calling super().__init__() the list of arguments cfg_fmt is passed to the parent constructor.

Following configurations are now valid in an configuration file:

[...]
- SkewerAdapterClippingModule:
    extra_args:
      - -k 12
      - -d 0.01

[...]
- SkewerAdapterClippingModule:
    extra_args: []

which is equivalent to falling back to the default argument

[...]
- SkewerAdapterClippingModule

We define a variety of validators that can readily be used. Please refer to Configuration for details.

Preparing the module¶

prepare(self, cfg) is the heart of the module and defines the commands that are executed when the module runs. It also registers the outputs and thus makes new files visible to downstream modules.

cfg is a dict of all configuration options set by the user. Configuration of all options requested as described in Adding module arguments can be accessed by their names.

The first action in the prepare method has to be the call to the parent’s prepare method:

def prepare(self, cfg):
   super().prepare(cfg)

Information from previously run modules and global configuration options can be obtained through a reference to Pipeline, which can be accessed through the private attribute self._pipeline.

def prepare(self, cfg):
   super().prepare(cfg)
   pipeline = self._pipeline
   general_cfg = pipeline.get_config('general')
   read_cfg = pipeline.get_config('reads')
   output_dir = general_cfg['output_dir']
   prefix = general_cfg['prefix']

Here we access the global configuration sections general and reads to obtain the path to the output directory and the file name prefix.

Pipeline.get_curfile() is used to obtain the path to files that are queued in the pipeline. The pipeline stores the most recent file path of each format.

def prepare(self, cfg):
    [...]
    fastq_file = pipeline.get_curfile(fmt='fastq')

Here we obtain the path to the most recently queued fastq file.

Now we have all information to construct the command line call for running skewer. The output path is stored in adapter_clipped_file, a file path relative to the output directory passed by the command line script.

def prepare(self, cfg):
    [...]
    adapter_clipped_file = os.path.join(output_dir, prefix + '_skewer.clipped')
    cmd = [
        'skewer',
        fastq_file,
        '-x %s' % general_cfg['adapter3prime'],
        '-m tail',
        '--min %s' % read_cfg['min_len'],
        '--quiet',
        '--stdout',
        '> %r' % adapter_clipped_file,
    ]
    if cfg['extra_args']:
        cmd.extend(cfg['extra_args'])
    self._cmds.append(cmd)

Having collected the command as a list of arguments in cmd, we append all user defined arguments that are conveniently stored in cfg and queue the command by appending it to the private self._cmds variable.

self._intermed_files.append(adapter_clipped_file)
pipeline.upd_curfile(fmt='fastq', filepath=adapter_clipped_file)

As a final step we register adapter_clipped_file as an intermediate file, meaning that it can be deleted provided this module is not the last in the pipeline.

Pipeline.upd_curfile() registers the file as new fastq file and thus makes it visible to following modules.

Note: At time prepare() is called, files to the passed paths may or may not exist. Only queue new commands by appending to self._cmds, never try to execute commands.

The full module¶

Putting everything together, the code of the SkewerAdapterClippingModule looks like this:

import os
from mockinbird.utils import pipeline as pl
from mockinbird.utils import config_validation as cv

class SkewerAdapterClippingModule(pl.CmdPipelineModule):

    def __init__(self, pipeline):
        cfg_fmt = [
            ('extra_args', cv.Annot(list, default=[])),
        ]
        super().__init__(pipeline, cfg_req=cfg_fmt)

    def prepare(self, cfg):
        super().prepare(cfg)
        pipeline = self._pipeline
        general_cfg = pipeline.get_config('general')
        read_cfg = pipeline.get_config('reads')
        output_dir = general_cfg['output_dir']
        prefix = general_cfg['prefix']
        fastq_file = pipeline.get_curfile(fmt='fastq')

        adapter_clipped_file = os.path.join(output_dir, prefix + '_skewer.clipped')

        cmd = [
            'skewer',
            fastq_file,
            '-x %s' % general_cfg['adapter3prime'],
            '-m tail',
            '--min %s' % read_cfg['min_len'],
            '--quiet',
            '--stdout',
            '> %r' % adapter_clipped_file,
        ]

        if cfg['extra_args']:
            cmd.extend(cfg['extra_args'])
        self._cmds.append(cmd)

        self._intermed_files.append(adapter_clipped_file)
        pipeline.upd_curfile(fmt='fastq', filepath=adapter_clipped_file)

The parent class CmdPipelineModule and the Pipeline take care of all required steps such as config parsing and validation, executing commands and cleaning up files.

Custom modules in config files¶

Having created your own module, you have to make the class importable in python. The cleanest way to achieve that is to make your module installable by writing a setup.py. For more information please refer to available documentation, e.g. here.

In the config file you have to refer to the module with its full import path, e.g.:

- mypackage.mymodule.SkewerAdapterClippingModule:
  extra_args:
    - -k 2