Best practise, tips and tricks

3.3.1. Running Nextflow Pipelines on a HPC

Nextflow, by default, launches mulitple parallel tasks that can be ran concurrently. Recall previously that we ran these tasks locally. We can however, use the process and executor scope to run these tasks using an HPC job scheduler such as SLURM, submitting the desired number of concurrent jobs.

process {
    executor = 'slurm'
    queue = 'PARTITION'
}

executor {
    queueSize = 4
}

By specifying the executor as slurm, Nexflow will submit each process as a separate job using the sbatch command. All jobs will be submittd to the PARTITION partition.

Inside the process { } scope, we can also define resources such as cpus, time, memory, and queue.

process {
    executor = 'slurm'
    queue = 'PARTITION'

    cpus = 1
    time = '2h'
    memory = '4.GB'
}

executor {
    queueSize = 4
}

Now each individual job will be executed using 1 CPU, 4GB of memory, and a maximum time limit of 2 hours. Since we didn’t specify a process label or a process name, this setting will apply for all processes within the pipeline.

Run processes on different partitions

Previously, we used the withLabel and withName process selectors to specify the cpus, time, memory for a group of processes, or a particular process. We can also use those process selectors to change what partition the job will be submitted to.

For example, suppose we have one process that requires the use of GPUs. If we change the queue to our GPU partition gpu_partition, this means all process jobs, even ones that don’t require GPU, will be ran on that partition.

process {
    executor = 'slurm'
    queue = 'gpu_partition'

    cpus = 1
    time = '2h'
    memory = '4.GB'
}

executor {
    queueSize = 4
}

Instead, we can use the withName process selector to send the job execution for that process to a GPU-speicifc partition. This means we won’t unnecessarily use GPU partition resources.

process {
    executor = 'slurm'
    queue = 'PARTITION'

    cpus = 1
    time = '2h'
    memory = '4.GB'

    withName: 'GPU_PROCESS' {
      queue = 'gpu_queue'
    }
}

executor {
    queueSize = 4
}

Specify infrastructure-specific directives for your jobs

Adjusting the custom configuration file above, we can define any native configuration options using the clusterOptions process directive, used to specify resources not already available in Nextflow.

For example, if you are running your pipeline on an HPC system that is billed, you have the option to specify what project the resource usage is billed to.

For example if you typically submit a job using the following command, where --account is used to specify the project number resource usage is billed to:

sbatch --acount=PROJECT1 script.sh

By default, this account option is not a supported directive in Nextflow. Therefore, we cannot use the following config:

process {
    executor = 'slurm'
    queue = 'PARTITION'
    account = 'PROJECT1

    cpus = 1
    time = '2h'
    memory = '4.GB'

    withName: 'GPU_PROCESS' {
      queue = 'gpu_queue'
    }
}

executor {
    queueSize = 4
}

Instead, this can be specified using clusterOptions, as below:

process {
    executor = 'slurm'
    queue = 'PARTITION'
    clusterOptions = "--account=PROJECT1"

    cpus = 1
    time = '2h'
    memory = '4.GB'

    withName: 'GPU_PROCESS' {
      queue = 'gpu_queue'
    }
}

executor {
    queueSize = 4
}

Caution

On certain HPC systems, you may not be able to submit new jobs from another job (such as an interactive session). In this case, you may get the following error:

sbatch: error: Batch job submission failed: Access/permission denied

To overcome this, use login-node (and exit your interactive session) when running your workflow.

3.3.2. Clean your work directory

Your work directory can get very big, very quickly (especially if you are using full sized datasets). It is good practise to clean your work directory regularly. Rather than removing the work folder with all of it’s contents, the Nextflow clean function allows you to selectively remove data associated with specific runs.

nextflow clean -help

Clean up project cache and work directories
Usage: clean [options] 
  Options:
    -after
       Clean up runs executed after the specified one
    -before
       Clean up runs executed before the specified one
    -but
       Clean up all runs except the specified one
    -n, -dry-run
       Print names of file to be removed without deleting them
       Default: false
    -f, -force
       Force clean command
       Default: false
    -h, -help
       Print the command usage
       Default: false
    -k, -keep-logs
       Removes only temporary files but retains execution log entries and
       metadata
       Default: false
    -q, -quiet
       Do not print names of files removed
       Default: false

The -after, -before, and -but options are all very useful to select specific runs to clean. The -dry-run option is also very useful to see which files will be removed if you were to -force the clean command.

Exercise: Use Nextflow to clean your work directory of all staged files, but keep your execution logs.

Solution

Use the Nextflow clean command with the -k and -f options:

nextflow clean -k -f

3.3.3. Change the default Nextflow cache strategy

Sometimes, a wrkflow execution is not resumed as expected. The default behaviour of Nextflow cache keys is to index the input file meta-data information. Reducing the cache stringency to lenient means the file cache keys are based only on filesize and path, and can help to avoid unexpectedly re-running certain processes when -resume is in use.

To apply lenient cache strategy to all of your runs, you could add the following to a custom configuration file:

process {
    cache = 'lenient'
}

Again, you can specify different cache stategies for different processes by using withName or withLabel.

3.3.4. Access private GitHub repositories

To interact with private repositories on GitHub, you can provide Nextflow with access to GitHub by specifying your GitHub user name and a Personal Access Token in the scm configuration file inside your specified .nextflow/ directory:

providers {

  github {
    user = 'rlupat'
    password = 'my-personal-access-token'
  }

}

Replace 'my-personal-access-token' with your personal access token.

3.3.5. Additional resources

Here are some useful resources to help you get started with running and developing nf-core pipelines: