Introduction to Nextflow

Objectives
  • Learn about the benefits of a workflow manager.
  • Learn Nextflow terminology.
  • Learn the basic commands and options required to run a Nextflow workflow

1.1.1. What is a Workflow?

A workflow is a sequence of steps or processes through which a piece of work passes 1. These steps may be dependent on other tasks to complete, or they can be run in parallel.

In bioinformatics workflows, each step would commonly comprises of a computation tool that operates on some input files to produce output files. A workflow can be as simple as many command line tools chained together to facilitate complex data manipulations.

A simple script like this example above would likely be adequate to prototype a workflow for a once-off analysis. However, there are many other considerations to take into account when turning this into an efficient and scalable computational workflow. For example:

  • How do we remove unneeded intermediate files?
  • How do we handle tool installations, versions, and dependencies?
  • How do we submit each of these jobs into a job scheduler (e.g Slurm) in a High Performance Computing (HPC) cluster?
  • How do we resume a partially failed workflow, especially if some of the initial completed tasks take long to compute?
  • How do we capture the log outputs from each tool?
  • How do we monitor how far the workflow has progressed?
  • How do we share this workflow?

Some of these considerations are the primary reasons why people choose to use a workflow system to orchestrate and deploy their computational workflows. There are many existing workflow systems, and Nextflow is one of them.

1.1.2. What is Nextflow?

Nextflow is a commonly used workflow system used to build scalable, portable, and reproducible computational workflows.

It addresses the considerations that we discussed, providing features such as:

  • Scalability, portability, and reproducibility
  • Ease of workflow deployment and sharing
  • Execution abstraction between workflow logic and execution system

Nextflow implements its own domain-specific language (DSL), which extends from the Groovy programming language (a derivation of Java).

You will notice there are libraries and functions used in Nextflow workflows that are still written in the native Groovy script.

1.1.3. Processes and Channels

In Nextflow, processes and channels are the fundamental building blocks of a workflow.

A process is a unit of execution that represents a single computational step in a workflow. It typically performs one specific task, and specifies its input and outputs and any directives required for its execution. Processe scripts can be written in any language that can be executed from the command line, such as Bash, Python, or R.

process sayHello {
  input: 
    val x
  output:
    stdout
  script:
    """
    echo '$x world!'
    """
}

In the above example, the process called sayHello takes one input variable x, prints the value that is input into the process with world!, and outputs the standard output stdout. In this example, Bash is used inside the script block. This is the default script language used in Nextflow.

Processes can also be parameterized, allowing for flexibility in their behaviour and to enable their reuse in the same workflow, and between different workflows. Processes are executed independently (i.e., they do not share a common writable state) as tasks and can run in parallel, allowing for efficient utilization of computing resources. Nextflow automatically manages the data dependencies between processes, ensuring that each process is executed only when its input data is available and all of its dependencies have been satisfied.

A channel is an asynchronous first-in, first-out (FIFO) queue that is used to join processes together.

Channels allow data to be passed between processes, and can be used to manage data, parallelize tasks, and structure workflows. A process can define one or more channels as an input and output. Ultimately, the workflow execution itself is implicitly defined by these declarations.

Channel.of('Bonjour', 'Ciao', 'Hello', 'Hola') 
    .set { ch_input }

ch_output = sayHello(ch_input)

In this example, four strings Bonjour, Ciao, Hello, Hola are assiged to a channel ch_input. This channel is then input to the sayHello process, where each element in the channel creates separate tasks that are executed in parallel. The output of the process is assigned to a channel ch_output.

1.1.4. Execution abstraction

While a process defines what command or script is executed, the executor determines how and where the script is executed.

Nextflow provides an abstraction between the workflow’s functional logic and the underlying execution system. This allows users to define a workflow once and execute it on different computing platforms without having to modify the workflow. Nextflow provides a variety of built-in execution options (eg. locally, HPC cluster, and cloud-based), and allows users to easily switch between these options using command-line arguments.

If not specified, processes are executed on your local computer, which is useful for workflow development and testing purposes. However, for real-world computational workflows, an HPC or cloud-based platform is often required.

You can find a full list of supported executors as well as how to configure them here.

1.1.5. Nextflow CLI

Nextflow provides a robust command line interface for the management and execution of workflows. Nextflow can be used on any POSIX compatible system (Linux, OS X, etc), and requires Bash 3.2 (or later) and Java 11 (or later, up to 18) to be installed.

Nextflow has been pre-installed in our workshop environment virtual machines. However, in your institutes’ HPC, a command to load nextflow might be different.

To load nextflow at Peter Mac HPC for example, you can do module load nextflow/24.10.5

Nextflow is distributed as a self-installing package and does not require any special installation procedure.

  1. Download the executable package using either wget -qO- https://get.nextflow.io | bash or curl -s https://get.nextflow.io | bash
  2. Make the binary executable on your system by running chmod +x nextflow.
  3. Move the nextflow file to a directory accessible by your $PATH variable, e.g, mv nextflow ~/bin/

1.1.6.Nextflow options and commands

Nextflow provides a robust command line interface for the management and execution of workflows. The top-level interface consists of options and commands.

You can list Nextflow options and commands with the -h option:

nextflow -h
Usage: nextflow [options] COMMAND [arg...]

Options:
  -C
     Use the specified configuration file(s) overriding any defaults
  -D
     Set JVM properties
  -bg
     Execute nextflow in background
  -c, -config
     Add the specified file to configuration set
  -config-ignore-includes
     Disable the parsing of config includes
  -h
     Print this help
  -log
     Set nextflow log file path
  -q, -quiet
     Do not print information messages
  -remote-debug
     Enable JVM interactive remote debugging (experimental)
  -syslog
     Send logs to syslog server (eg. localhost:514)
  -trace
     Enable trace level logging for the specified package name - multiple packages can be provided separating them with a comma e.g. '-trace nextflow,io.seqera'
  -v, -version
     Print the program version

Commands:
  clean         Clean up project cache and work directories
  clone         Clone a project into a folder
  config        Print a project configuration
  console       Launch Nextflow interactive console
  drop          Delete the local copy of a project
  help          Print the usage help for a command
  info          Print project and system runtime information
  inspect       Inspect process settings in a pipeline project
  kuberun       Execute a workflow in a Kubernetes cluster (experimental)
  list          List all downloaded projects
  log           Print executions log and runtime info
  plugin        Execute plugin-specific commands
  pull          Download or update a project
  run           Execute a pipeline project
  secrets       Manage pipeline secrets
  self-update   Update nextflow runtime to the latest available version
  view          View project script file(s)

Options for each Nextflow command can also be viewed by appending the -help option. Note that there is only one dash in the -help option.

nextflow run -help

1.1.7. Managing your environment

You can also use environment variables to control the Nextflow runtime and the underlying Java virtual machine. These variables can be exported before running a workflow and will be interpreted by Nextflow.

Since we are using shared storage, we should consider storing shared software in a location that is accessible by all users in the cluster. The key Nextflow environment variables to do this is NXF_APPTAINER_CACHEDIR for Apptainer, NXF_SINGULARITY_CACHEDIR for Singularity, and NXF_CONDA_CACHEDIR for Conda environment managers.

For this workshop, we will be using apptainer as our container tool. As all users have their individual VMs, we will set the NXF_APPTAINER_CACHEDIR to ~/apptainer_cache.

export NXF_APPTAINER_CACHEDIR="~/apptainer_cache"

At Peter Mac HPC, you would set this variable to:

export NXF_APPTAINER_CACHEDIR="/config/binaries/singularity/containers_devel/nextflow

Images downloaded by workflow executions will now be stored in this directory.

1.1.8. Executing a workflow

Nextflow also enables users to run workflows hosted in a remote code repository such as GitHub, by specifying the repository URL or its qualified name. For example, if a Nextflow project bar is hosted in a GitHub repository foo at the address http://github.com/foo/bar, it could be run using either:

nextflow run foo/bar

nextflow run http://github.com/foo/bar

As our first exercise, we will execute the hello workflow directly from nextflow-io GitHub repository. Start by creating a directory lesson1.1.

mkdir ./lesson1.1 && cd $_

Exercise: Use the nextflow run command to run the pipeline using the qualified name.

nextflow run nextflow-io/hello
N E X T F L O W  ~  version 23.04.1
Launching `https://github.com/nextflow-io/hello` [deadly_hoover] DSL2 - revision: 1d71f857bb [master]
executor >  local (4)
[a2/9c17c8] process > sayHello (2) [100%] 4 of 4 ✔
Hello world!

Bonjour world!

Hola world!

Ciao world!

If you run a workflow, it will look for a local file with the workflow name you’ve specified. If that file does not exist, it will look for a public repository with the same name on GitHub (unless otherwise specified). If it is found, Nextflow will automatically pull the workflow to your global cache and execute it.

Be aware of what is already in your current working directory where you launch your workflow, if there are other workflows (or configuration files) you may encounter unexpected results.

Should I be running Nextflow on login node?

Short answer, no! Even though nextflow is not performing any heavy computation, without the necessary execution configurations, all the processes within the workflow will be executed locally.

At Peter Mac HPC, we would run a nextflow job inside an sbatch job; or via an interactive slurm job`

srun --pty -p <PARTITION> --mem 8GB --mincpus 2 -t 0-5:00 bash
Key points
  • Nextflow is a workflow system that provide features that enable a workflow to be scalable, portable and reproducible
  • Environment variables can be used to control your Nextflow runtime (especially to set your shared software location)
  • The command nextflow run can be used to execute a workflow


Next Chapter: Introduction to nf-core



This workshop is adapted from various nextflow training materials, including:

Footnotes

  1. https://www.lexico.com/definition/workflow↩︎