Preprocessing

Preprocessing of NGS data is essentially uniform across various data types, involving several steps ensuring data quality and integrity before specialized downstream analyses diverge for distinct biological insights. The fundamental steps during preprocessing are:

  • Quality Control (QC) : The first task is to assess the quality of your sequencing data. QC involves checking for issues such as base call accuracy, sequence length distribution, and per-base sequence quality scores. Tools like   FastQC   are commonly used to generate quality reports.
  • Adapter Trimming : During library preparation, sequencing adapters are added to the ends of DNA fragments. These adapters must be removed from the sequencing reads to prevent them from interfering with downstream analyses. Tools like   Trimmomatic   or   Trim_Galore   are employed for this purpose.
  • Read Filtering : Some sequencing reads may be of low quality or contain sequencing errors. Filtering criteria are applied to remove reads that do not meet specified quality thresholds. Common filtering criteria include read length and base quality scores.
  • Reference Mapping : For many NGS applications, such as DNA-seq and RNA-seq, reads are aligned or mapped to a reference genome or transcriptome. This process allows for variant calling, gene expression quantification, and other analyses.   BWA   for DNA or   STAR   for RNAseq are common aligners.
  • Duplicate Removal : Sequencing library preparation can lead to the generation of duplicate reads, which can bias downstream analyses. Duplicate removal is crucial to ensure that each unique fragment is represented only once.

In a nutshell, preprocessing data involves going from FastQC to BAM files.

In the fundamental section you learned how to transfer your FastQC files to the cluster, how to find in the cluster the software that you need (e.g. fastqC, BWA, ...), how to load the software, and how to use the command line (or write a script) to run these tools. ​With this knowledge, you are now well-prepared to carry out NGS data preprocessing on the cluster: you just need to go through the steps above, decide what tools to use in each step, and run the tools on the command line, one step at a time. All the help you need is in the manuals for each tool.

But if you feel that you still need a bit of extra help with this, don't worry, we have also included below a few examples of NGS hands-on tutorials and workflows. These tutorials provide clear step-by-step instructions, so you can use them for further guidance.

Later on, in the Pipelines section, you'll also discover how to use Nextflow pipelines for NGS data analysis, particularly for preprocessing data, but sometimes also for downstream analysis. Running a Nextflow pipeline is generally straightforward, but occasional issues may arise and debugging them constitutes a more advanced topic, and that's why we have left that for a later section. Additionally, we believe that building a solid understanding of the individual steps before delving into pipelines will enhance your learning experience.

TO DO

This "Data Wrangling and Processing for Genomics" course from Data Carpentry is a continuation of the "Introduction to the Command Line for Genomics" that we included in The Unix Shell section. This particular tutorial focuses on preprocessing data for variant calling.

TAKE THE COURSE

The "RNA-seq lesson" course developed by Bliek Tijs, Frans van der Kloet and Marc Galland from the Amsterdam Science Park Study Group, is a Carpentries-style lesson on RNA-Sequencing data analysis. Section "03 From fastq files to alignments" focuses on the preprocessing steps.

TAKE THE COURSE

Practicals 1 ("QC and quality trimming of raw sequencing reads")  and 2 ("Short read alignment with STAR") from the Functional Genomics" course developed by the Bioinformatics Core - CRUK Cambridge Institute, are also hands-on tutorials that focus on preprocessing steps for RNAseq data

practical 1 practical 2