Best practise, tips and tricks
2.3.1. Running Nextflow Pipelines on a HPC
Nextflow, by default, spawns parallel task executions wherever it is running. You can use Nextflow’s executors feature to run these tasks using an HPC job schedulers such as SLURM and PBS Pro. Use a custom configuration file to send all processes to the job scheduler as separate jobs and define essential resource requests like cpus
, time
, memory
, and queue
inside a process {}
scope.
Run all workflow tasks as separate jobs on HPC
In this custom configuration file we have sent all tasks that a workflow is running to a PBS Pro job scheduler and specified jobs to be run on the normal queue, each running for a max time of 3 hours with 1 cpu and 4 Gb of memory:
process {
executor = 'slurm'
queue = 'prod_short'
cpus = 1
time = '2h'
memory = '4.GB' }
Run processes with different resource profiles as HPC jobs
Adjusting the custom configuration file above, we can use the withName {}
process selector to specify process-specific resource requirements:
process {
executor = 'slurm'
withName processONE {
queue = 'prod_short'
cpus = 1
time = '2h'
memory = '4.GB'
}
withName processTWO {
queue = 'prod_med'
cpus = 2
time = '10h'
memory = '50.GB'
} }
Specify infrastructure-specific directives for your jobs
Adjusting the custom configuration file above, we can define any native configuration options using the clusterOptions directive. We can use this to specify non-standard resources. Below we have specified which HPC project code to bill for all process jobs:
You can also setup a config to tailor specific to Peter Mac’s HPC partitions setup.
executor {
queueSize = 100
queueStatInterval = '1 min'
pollInterval = '1 min'
submitRateLimit = '20 min'
}
process {
executor = 'slurm'
cache = 'lenient'
beforeScript = 'module load singularity'
stageInMode = 'symlink'
queue = { task.time < 2.h ? 'prod_short' : task.time < 24.h ? 'prod_med' : 'prod' } }
Run the previous nf-core/rnaseq workflow using the process
and executor
scope above (in a config file), and send each task to slurm.
Create a nextflow.config file
process.executor = 'slurm'
Run the nfcore/rna-seq workflow again
nextflow run nf-core/rnaseq -r 3.11.1 \
-params-file workshop-params.yaml
-profile singularity \
--max_memory '6.GB' \
--max_cpus 2 \
-resume
Did you get the following error?
sbatch: error: Batch job submission failed: Access/permission denied
Try running the same workflow on login-node and observe the difference
>>> squeue -u rlupat -i 5
17429286 prod nf-NFCOR rlupat R 0:03 1 papr-res-compute01
2.3.2. Things to note for Peter Mac Cluster
Best not to launch nextflow on a login-node
Even though nextflow is not supposed to be doing any heavy computation, nextflow still consume CPUs and memory to do some of the operations. Our login node is not designed to handle multiple users running a Groovy applicaation that spawn further operations.
In saying that, launching nextflow from a compute node is also not possible from our previous exercise. So what is the solution?
Our cluster prohibits compute nodes from spawning new jobs. There are only two partitions that are currently available to spawn new jobs janis
and janis-dev
. Therefore, if you are submitting your nextflow pipeline in an sbatch file, it is probably good to point that to the janis
node.
Set your working directory to a scratch space
When we launch a nextflow workflow, by default it will use the current directory to create a work
directory and all the intermediate files will be stored there, only to be cleaned at completion. This means that if you run a long running workflow, chances are your intermediate files will be sent to the tape long term archiving. There are also benefits for running in scratch, as we are using a faster spinning disk, resulting in a faster execution.
2.3.3. Clean your work directory
Your work directory can get very big very quickly (especially if you are using full sized datasets). It is good practise to clean
your work directory regularly. Rather than removing the work
folder with all of it’s contents, the Nextflow clean
function allows you to selectively remove data associated with specific runs.
nextflow clean -help
The -after
, -before
, and -but
options are all very useful to select specific runs to clean
. The -dry-run
option is also very useful to see which files will be removed if you were to -force
the clean
command.
You Nextflow to clean
your work work
directory of staged files but keep your execution logs.
Use the Nextflow clean
command with the -k
and -f
options:
nextflow clean -k -f
2.3.4. Change default Nextflow cache strategy
Workflow execution is sometimes not resumed as expected. The default behaviour of Nextflow cache keys is to index the input files meta-data information. Reducing the cache stringency to lenient
means the files cache keys are based only on filesize and path, and can help to avoid unexpectedly re-running certain processes when -resume
is in use.
To apply lenient cache strategy to all of your runs, you could add to a custom configuration file:
process {
cache = 'lenient' }
You can specify different cache stategies for different processes by using withName
or withLabel
. You can specify a particular cache strategy be applied to certain profiles
within your institutional config, or to apply to all profiles described within that config by placing the above process
code block outside the profiles
scope.
2.3.5. Access private GitHub repositories
To interact with private repositories on GitHub, you can provide Nextflow with access to GitHub by specifying your GitHub user name and a Personal Access Token in the scm
configuration file inside your specified .nextflow/
directory:
providers {
github {
user = 'rlupat'
password = 'my-personal-access-token'
}
}
2.3.6. Nextflow Tower
2.3.7. Additional resources
Here are some useful resources to help you get started with running nf-core pipelines and developing Nextflow pipelines:
- Nextflow tutorials
- nf-core pipeline tutorials
- Nextflow patterns
- HPC tips and tricks
- Nextflow coding best practice recommendations
- The Nextflow blog
These materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub