Best practise, tips and tricks

2.3.1. Running Nextflow Pipelines on a HPC

Nextflow, by default, spawns parallel task executions wherever it is running. You can use Nextflow’s executors feature to run these tasks using an HPC job schedulers such as SLURM and PBS Pro. Use a custom configuration file to send all processes to the job scheduler as separate jobs and define essential resource requests like cpus, time, memory, and queue inside a process {} scope.

Run all workflow tasks as separate jobs on HPC

In this custom configuration file we have sent all tasks that a workflow is running to a PBS Pro job scheduler and specified jobs to be run on the normal queue, each running for a max time of 3 hours with 1 cpu and 4 Gb of memory:

process {
  executor = 'slurm'
  queue = 'prod_short'
  cpus = 1
  time = '2h'
  memory = '4.GB'
}

Run processes with different resource profiles as HPC jobs

Adjusting the custom configuration file above, we can use the withName {} process selector to specify process-specific resource requirements:

process {
  executor = 'slurm'
    
  withName processONE {
    queue = 'prod_short'
    cpus = 1
    time = '2h'
    memory = '4.GB'
  }

  withName processTWO {
    queue = 'prod_med'
    cpus = 2
    time = '10h'
    memory = '50.GB'
  }
}

Specify infrastructure-specific directives for your jobs

Adjusting the custom configuration file above, we can define any native configuration options using the clusterOptions directive. We can use this to specify non-standard resources. Below we have specified which HPC project code to bill for all process jobs:

You can also setup a config to tailor specific to Peter Mac’s HPC partitions setup.

executor {
    queueSize         = 100
    queueStatInterval = '1 min'
    pollInterval      = '1 min'
    submitRateLimit   = '20 min'
}

process {
    executor = 'slurm'
    cache    = 'lenient'
    beforeScript = 'module load singularity'
    stageInMode = 'symlink'
    queue = { task.time < 2.h ? 'prod_short' : task.time < 24.h ? 'prod_med' : 'prod' } 
}
Challenge

Run the previous nf-core/rnaseq workflow using the process and executor scope above (in a config file), and send each task to slurm.

Create a nextflow.config file

process.executor = 'slurm'

Run the nfcore/rna-seq workflow again

nextflow run nf-core/rnaseq -r 3.11.1 \
    -params-file workshop-params.yaml
    -profile singularity \
    --max_memory '6.GB' \
    --max_cpus 2 \
    -resume 

Did you get the following error?

sbatch: error: Batch job submission failed: Access/permission denied

Try running the same workflow on login-node and observe the difference

>>> squeue -u rlupat -i 5

          17429286      prod nf-NFCOR   rlupat  R       0:03      1 papr-res-compute01


2.3.2. Things to note for Peter Mac Cluster

Best not to launch nextflow on a login-node

Even though nextflow is not supposed to be doing any heavy computation, nextflow still consume CPUs and memory to do some of the operations. Our login node is not designed to handle multiple users running a Groovy applicaation that spawn further operations.

In saying that, launching nextflow from a compute node is also not possible from our previous exercise. So what is the solution?

Our cluster prohibits compute nodes from spawning new jobs. There are only two partitions that are currently available to spawn new jobs janis and janis-dev. Therefore, if you are submitting your nextflow pipeline in an sbatch file, it is probably good to point that to the janis node.

Set your working directory to a scratch space

When we launch a nextflow workflow, by default it will use the current directory to create a work directory and all the intermediate files will be stored there, only to be cleaned at completion. This means that if you run a long running workflow, chances are your intermediate files will be sent to the tape long term archiving. There are also benefits for running in scratch, as we are using a faster spinning disk, resulting in a faster execution.

2.3.3. Clean your work directory

Your work directory can get very big very quickly (especially if you are using full sized datasets). It is good practise to clean your work directory regularly. Rather than removing the work folder with all of it’s contents, the Nextflow clean function allows you to selectively remove data associated with specific runs.

nextflow clean -help

The -after, -before, and -but options are all very useful to select specific runs to clean. The -dry-run option is also very useful to see which files will be removed if you were to -force the clean command.

Challenge

You Nextflow to clean your work work directory of staged files but keep your execution logs.

Use the Nextflow clean command with the -k and -f options:

nextflow clean -k -f


2.3.4. Change default Nextflow cache strategy

Workflow execution is sometimes not resumed as expected. The default behaviour of Nextflow cache keys is to index the input files meta-data information. Reducing the cache stringency to lenient means the files cache keys are based only on filesize and path, and can help to avoid unexpectedly re-running certain processes when -resume is in use.

To apply lenient cache strategy to all of your runs, you could add to a custom configuration file:

process {
    cache = 'lenient'
}

You can specify different cache stategies for different processes by using withName or withLabel. You can specify a particular cache strategy be applied to certain profiles within your institutional config, or to apply to all profiles described within that config by placing the above process code block outside the profiles scope.

2.3.5. Access private GitHub repositories

To interact with private repositories on GitHub, you can provide Nextflow with access to GitHub by specifying your GitHub user name and a Personal Access Token in the scm configuration file inside your specified .nextflow/ directory:

providers {

  github {
    user = 'rlupat'
    password = 'my-personal-access-token'
  }

}

2.3.6. Nextflow Tower

BioCommons Tower Instance

2.3.7. Additional resources

Here are some useful resources to help you get started with running nf-core pipelines and developing Nextflow pipelines:


These materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub