Introduction to nf-core

Objectives
  • Learn about the core features of nf-core.
  • Learn the terminology used by nf-core.
  • Use Nextflow to pull and run the nf-core/testpipeline workflow

Introduction to nf-core: Introduce nf-core features and concepts, structures, tools, and example nf-core pipelines

1.2.1. What is nf-core?

nf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.

nf-core provides a standardized set of best practices, guidelines, and templates for building and sharing bioinformatics workflows. These workflows are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources.

The community is a diverse group of bioinformaticians, developers, and researchers from around the world who collaborate on developing and maintaining a growing collection of high-quality workflows. These workflows cover a range of applications, including transcriptomics, proteomics, and metagenomics.

One of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the workflows are robust, well-documented, and validated against real-world datasets. This helps to increase the reliability and reproducibility of bioinformatics analyses and ultimately enables researchers to accelerate their scientific discoveries.

nf-core is published in Nature Biotechnology: Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology

Key Features of nf-core workflows

  • Documentation
    • nf-core workflows have extensive documentation covering installation, usage, and description of output files to ensure that you won’t be left in the dark.
  • Stable Releases
    • nf-core workflows use GitHub releases to tag stable versions of the code and software, making workflow runs totally reproducible.
  • Packaged software
    • Pipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or other software management tools. There is no need for any software installations.
  • Portable and reproducible
    • nf-core workflows follow best practices to ensure maximum portability and reproducibility. The large community makes the workflows exceptionally well-tested and easy to execute.
  • Cloud-ready
    • nf-core workflows are tested on AWS

1.2.2. Executing an nf-core workflow

The nf-core website has a full list of workflows and asssociated documentation tno be explored.

Each workflow has a dedicated page that includes expansive documentation that is split into 7 sections:

  • Introduction
    • An introduction and overview of the workflow
  • Results
    • Example output files generated from the full test dataset
  • Usage docs
    • Descriptions of how to execute the workflow
  • Parameters
    • Grouped workflow parameters with descriptions
  • Output docs
    • Descriptions and examples of the expected output files
  • Releases & Statistics
    • Workflow version history and statistics

As nf-core is a community development project the code for a pipeline can be changed at any time. To ensure that you have locked in a specific version of a pipeline you can use Nextflow’s built-in functionality to pull a workflow. The Nextflow pull command can download and cache workflows from GitHub repositories:

nextflow pull nf-core/<pipeline>

Nextflow run will also automatically pull the workflow if it was not already available locally:

nextflow run nf-core/<pipeline>

Nextflow will pull the default git branch if a workflow version is not specified. This will be the master branch for nf-core workflows with a stable release. nf-core workflows use GitHub releases to tag stable versions of the code and software. You will always be able to execute a previous version of a workflow once it is released using the -revision or -r flag.

For this section of the workshop we will be using the nf-core/testpipeline as an example.

As we will be running some bioinformatics tools, we will need to make sure of the following:

  • We are not running on login node
  • singularity module is loaded (module load singularity/3.7.3)
Setup an interactive session
srun --pty -p prod_short --mem 20GB --cpus-per-task 2 -t 0-2:00 /bin/bash

Ensure the required modules are loaded

module list
Currently Loaded Modulefiles:
  1) java/jdk-17.0.6      2) nextflow/23.04.1     3) squashfs-tools/4.5   4) singularity/3.7.3

We will also create a separate output directory for this section.

cd /scratch/users/<your-username>/nfWorkshop; mkdir ./lesson1.2 && cd $_

The base command we will be using for this section is:

nextflow run nf-core/testpipeline -profile test,singularity --outdir my_results

1.2.3. Workflow structure

nf-core workflows start from a common template and follow the same structure. Although you won’t need to edit code in the workflow project directory, having a basic understanding of the project structure and some core terminology will help you understand how to configure its execution.

Let’s take a look at the code for the nf-core/rnaseq pipeline.

Nextflow DSL2 workflows are built up of subworkflows and modules that are stored as separate .nf files.

Most nf-core workflows consist of a single workflow file (there are a few exceptions). This is the main <workflow>.nf file that is used to bring everything else together. Instead of having one large monolithic script, it is broken up into a combination of subworkflows and modules.

A subworkflow is a groups of modules that are used in combination with each other and have a common purpose. Subworkflows improve workflow readability and help with the reuse of modules within a workflow. The nf-core community also shares subworkflows in the nf-core subworkflows GitHub repository. Local subworkflows are workflow specific that are not shared in the nf-core subworkflows repository.

Let’s take a look at the BAM_STATS_SAMTOOLS subworkflow.

This subworkflow is comprised of the following modules: - SAMTOOLS_STATS - SAMTOOLS_IDXSTATS, and - SAMTOOLS_FLAGSTAT

A module is a wrapper for a process, most modules will execute a single tool and contain the following definitions: - inputs - outputs, and - script block.

Like subworkflows, modules can also be shared in the nf-core modules GitHub repository or stored as a local module. All modules from the nf-core repository are version controlled and tested to ensure reproducibility. Local modules are workflow specific that are not shared in the nf-core modules repository.

1.2.4. Viewing parameters

Every nf-core workflow has a full list of parameters on the nf-core website. When viewing these parameters online, you will also be shown a description and the type of the parameter. Some parameters will have additional text to help you understand when and how a parameter should be used.

Parameters and their descriptions can also be viewed in the command line using the run command with the --help parameter:

nextflow run nf-core/<workflow> --help
Challenge

View the parameters for the nf-core/testpipeline workflow using the command line:

The nf-core/testpipeline workflow parameters can be printed using the run command and the --help option:

nextflow run nf-core/testpipeline --help

1.2.5. Parameters in the command line

Parameters can be customized using the command line. Any parameter can be configured on the command line by prefixing the parameter name with a double dash (--):

nextflow run nf-core/<workflow> --<parameter>
Tip

Nextflow options are prefixed with a single dash (-) and workflow parameters are prefixed with a double dash (--).

Depending on the parameter type, you may be required to add additional information after your parameter flag. For example, for a string parameter, you would add the string after the parameter flag:

nextflow run nf-core/<workflow> --<parameter> string
Challenge

Give the MultiQC report for the nf-core/testpipeline workflow the name of your favorite animal using the multiqc_title parameter using a command line flag:

Add the --multiqc_title flag to your command and execute it. Use the -resume option to save time:

nextflow run nf-core/testpipeline -profile test,singularity --multiqc_title koala --outdir my_results -resume

In this example, you can check your parameter has been applied by listing the files created in the results folder (my_results):

ls my_results/multiqc/

1.2.6. Configuration files

Configuration files are .config files that can contain various workflow properties. Custom paths passed in the command-line using the -c option:

nextflow run nf-core/<workflow> -profile test,docker -c <path/to/custom.config>

Multiple custom .config files can be included at execution by separating them with a comma (,).

Custom configuration files follow the same structure as the configuration file included in the workflow directory. Configuration properties are organized into scopes by grouping the properties in the same scope using the curly brackets notation. For example:

alpha {
     x = 1
     y = 'string value..'
}

Scopes allow you to quickly configure settings required to deploy a workflow on different infrastructure using different software management. For example, the executor scope can be used to provide settings for the deployment of a workflow on a HPC cluster. Similarly, the singularity scope controls how Singularity containers are executed by Nextflow. Multiple scopes can be included in the same .config file using a mix of dot prefixes and curly brackets. A full list of scopes is described in detail here.

Challenge

Give the MultiQC report for the nf-core/testpipeline workflow the name of your favorite color using the multiqc_title parameter in a custom my_custom.config file:

Create a custom my_custom.config file that contains your favourite colour, e.g., blue:

params {
    multiqc_title = "blue"
}

Include the custom .config file in your execution command with the -c option:

nextflow run nf-core/testpipeline --outdir my_results -profile test,singularity -resume -c my_custom.config

Check that it has been applied:

ls my_results/multiqc/

Why did this fail?

You can not use the params scope in custom configuration files. Parameters can only be configured using the -params-file option and the command line. While parameter is listed as a parameter on the STDOUT, it was not applied to the executed command.

We will revisit this at the end of the module

1.2.7 Parameter files

Parameter files are used to define params options for a pipeline, generally written in the YAML format. They are added to a pipeline with the flag --params-file

Example YAML:

"<parameter1_name>": 1,
"<parameter2_name>": "<string>",
"<parameter3_name>": true
Challenge

Based on the failed application of the parameter multiqc_title create a my_params.yml setting multiqc_title to your favourite colour. Then re-run the pipeline with the your my_params.yml

Set up my_params.yml

multiqc_title: "black"
nextflow run nf-core/testpipeline -profile test,singularity --params-file my_params.yml --outdir Lesson1_2

1.2.8. Default configuration files

All parameters will have a default setting that is defined using the nextflow.config file in the workflow project directory. By default, most parameters are set to null or false and are only activated by a profile or configuration file.

There are also several includeConfig statements in the nextflow.config file that are used to load additional .config files from the conf/ folder. Each additional .config file contains categorized configuration information for your workflow execution, some of which can be optionally included:

  • base.config
    • Included by the workflow by default.
    • Generous resource allocations using labels.
    • Does not specify any method for software management and expects software to be available (or specified elsewhere).
  • igenomes.config
    • Included by the workflow by default.
    • Default configuration to access reference files stored on AWS iGenomes.
  • modules.config
    • Included by the workflow by default.
    • Module-specific configuration options (both mandatory and optional).

Notably, configuration files can also contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated when launching a workflow by using the -profile command option:

nextflow run nf-core/<workflow> -profile <profile>

Profiles used by nf-core workflows include:

  • Software management profiles
    • Profiles for the management of software using software management tools, e.g., docker, singularity, and conda.
  • Test profiles
    • Profiles to execute the workflow with a standardized set of test data and parameters, e.g., test and test_full.

Multiple profiles can be specified in a comma-separated (,) list when you execute your command. The order of profiles is important as they will be read from left to right:

nextflow run nf-core/<workflow> -profile test,singularity

nf-core workflows are required to define software containers and conda environments that can be activated using profiles.

Tip

If you’re computer has internet access and one of Conda, Singularity, or Docker installed, you should be able to run any nf-core workflow with the test profile and the respective software management profile ‘out of the box’. The test data profile will pull small test files directly from the nf-core/test-data GitHub repository and run it on your local system. The test profile is an important control to check the workflow is working as expected and is a great way to trial a workflow. Some workflows have multiple test profiles for you to test.

Key points
  • nf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.
  • Nextflow can be used to pull nf-core workflows.
  • nf-core workflows follow similar structures
  • nf-core workflows are configured using parameters and profiles

These materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub