Pipelines

Pipelines

You are here because you want to process NGS sequencing data. As you should know by now, this involves a series of preprocessing steps including QC of the fastQ files, trimming of reads, alignment... plus other steps that depend on the type of NGS data you have: transcript quantification for RNAseq, variant calling for WGS, peak calling for ChIP-seq...

As you also should know by now, you can run this one step at a time in the command line, or if you took a look at the Bash Scripting section, you should also know that you can run these steps sequencially using scripts. But what we want to introduce now is the concept of pipelines, a more advanced and comprehensive solution compared to a script:

script is typically a single, standalone program or set of commands designed to perform a specific task or a series of related tasks. Scripts are often executed linearly or sequentially, with one command or operation running after the other. They may not inherently support parallel processing without additional coding.
On the other hand, a pipeline is a more comprehensive and structured framework that manages the execution of multiple scripts or tasks in a coordinated and often parallel manner. Pipelines define the workflow, including the order of execution, data flow, dependencies between tasks, and error handling. Additionally, pipelines emphasize reproducibility by encapsulating the entire analysis workflow, including software versions and parameters. This ensures that the same analysis can be executed consistently across different environments.


Nf-core and Nextflow

There is a wide range of existing bioinformatic pipelines tailored for NGS data analysis. But specifically, we want to highlight and highly recommend the pipelines developed by nf-core. Nf-core is a collaborative initiative in the field of bioinformatics. It comprises a growing collection of community-driven, production-ready, and standardized bioinformatics pipelines that all share one important feature: they are all built in the Nextflow framework.

So what is Nextflow? ​Nextflow is ​a powerful and versatile workflow management system designed to simplify and streamline the execution of complex computational pipelines.
At its core, a Nextflow pipeline consists of a series of interconnected processes, each representing a discrete computational task. These processes can be written in various scripting languages (e.g., Bash, Python, R) and can be executed on a variety of computing environments, including local workstations, high-performance clusters, or cloud platforms. Nextflow excels in managing the dependencies between processes and ensuring parallelism when possible. 

----------------------------------------

We think that Nextflow is the best framework for teaching wet-lab scientist how to run pipelines because of its user-friendly scripting language, extensive documentation, and built-in support for managing complexities like data dependencies and parallel processing. So in this section, you will be introduced to Nextflow and will learn how to run existing nf-core pipelines (or even develop your own!)

Nf-core

Nextflow

Using nf-core in Myriad