Best practise, tips and tricks
3.3.1. Running Nextflow Pipelines on a HPC
Nextflow, by default, launches mulitple parallel tasks that can be ran concurrently. Recall previously that we ran these tasks locally. We can however, use the process
and executor
scope to run these tasks using an HPC job scheduler such as SLURM, submitting the desired number of concurrent jobs.
process {
executor = 'slurm'
queue = 'PARTITION'
}
executor {
queueSize = 4 }
By specifying the executor as slurm
, Nexflow will submit each process as a separate job using the sbatch
command. All jobs will be submittd to the PARTITION
partition.
Inside the process { }
scope, we can also define resources such as cpus
, time
, memory
, and queue
.
process {
executor = 'slurm'
queue = 'PARTITION'
cpus = 1
time = '2h'
memory = '4.GB'
}
executor {
queueSize = 4 }
Now each individual job will be executed using 1 CPU, 4GB of memory, and a maximum time limit of 2 hours. Since we didn’t specify a process label or a process name, this setting will apply for all processes within the pipeline.
Run processes on different partitions
Previously, we used the withLabel
and withName
process selectors to specify the cpus
, time
, memory
for a group of processes, or a particular process. We can also use those process selectors to change what partition the job will be submitted to.
For example, suppose we have one process that requires the use of GPUs. If we change the queue
to our GPU partition gpu_partition
, this means all process jobs, even ones that don’t require GPU, will be ran on that partition.
process {
executor = 'slurm'
queue = 'gpu_partition'
cpus = 1
time = '2h'
memory = '4.GB'
}
executor {
queueSize = 4 }
Instead, we can use the withName
process selector to send the job execution for that process to a GPU-speicifc partition. This means we won’t unnecessarily use GPU partition resources.
process {
executor = 'slurm'
queue = 'PARTITION'
cpus = 1
time = '2h'
memory = '4.GB'
withName: 'GPU_PROCESS' {
queue = 'gpu_queue'
}
}
executor {
queueSize = 4 }
Specify infrastructure-specific directives for your jobs
Adjusting the custom configuration file above, we can define any native configuration options using the clusterOptions process directive, used to specify resources not already available in Nextflow.
For example, if you are running your pipeline on an HPC system that is billed, you have the option to specify what project the resource usage is billed to.
For example if you typically submit a job using the following command, where --account
is used to specify the project number resource usage is billed to:
sbatch --acount=PROJECT1 script.sh
By default, this account
option is not a supported directive in Nextflow. Therefore, we cannot use the following config:
process {
executor = 'slurm'
queue = 'PARTITION'
account = 'PROJECT1
cpus = 1
time = '2h'
memory = '4.GB'
withName: 'GPU_PROCESS' {
queue = 'gpu_queue'
}
}
executor {
queueSize = 4 }
Instead, this can be specified using clusterOptions
, as below:
process {
executor = 'slurm'
queue = 'PARTITION'
clusterOptions = "--account=PROJECT1"
cpus = 1
time = '2h'
memory = '4.GB'
withName: 'GPU_PROCESS' {
queue = 'gpu_queue'
}
}
executor {
queueSize = 4 }
On certain HPC systems, you may not be able to submit new jobs from another job (such as an interactive session). In this case, you may get the following error:
sbatch: error: Batch job submission failed: Access/permission denied
To overcome this, use login-node (and exit your interactive session) when running your workflow.
3.3.2. Clean your work directory
Your work directory can get very big, very quickly (especially if you are using full sized datasets). It is good practise to clean
your work directory regularly. Rather than removing the work
folder with all of it’s contents, the Nextflow clean
function allows you to selectively remove data associated with specific runs.
nextflow clean -help
Clean up project cache and work directories
Usage: clean [options]
Options:
-after
Clean up runs executed after the specified one
-before
Clean up runs executed before the specified one
-but
Clean up all runs except the specified one
-n, -dry-run
Print names of file to be removed without deleting them
Default: false
-f, -force
Force clean command
Default: false
-h, -help
Print the command usage
Default: false
-k, -keep-logs
Removes only temporary files but retains execution log entries and
metadata
Default: false
-q, -quiet
Do not print names of files removed Default: false
The -after
, -before
, and -but
options are all very useful to select specific runs to clean
. The -dry-run
option is also very useful to see which files will be removed if you were to -force
the clean
command.
Exercise: Use Nextflow to clean
your work
directory of all staged files, but keep your execution logs.
Use the Nextflow clean
command with the -k
and -f
options:
nextflow clean -k -f
3.3.3. Change the default Nextflow cache strategy
Sometimes, a wrkflow execution is not resumed as expected. The default behaviour of Nextflow cache keys is to index the input file meta-data information. Reducing the cache stringency to lenient
means the file cache keys are based only on filesize and path, and can help to avoid unexpectedly re-running certain processes when -resume
is in use.
To apply lenient cache strategy to all of your runs, you could add the following to a custom configuration file:
process {
cache = 'lenient' }
Again, you can specify different cache stategies for different processes by using withName
or withLabel
.
3.3.4. Access private GitHub repositories
To interact with private repositories on GitHub, you can provide Nextflow with access to GitHub by specifying your GitHub user name and a Personal Access Token in the scm
configuration file inside your specified .nextflow/
directory:
providers {
github {
user = 'rlupat'
password = 'my-personal-access-token'
}
}
Replace 'my-personal-access-token'
with your personal access token.
3.3.5. Additional resources
Here are some useful resources to help you get started with running and developing nf-core pipelines: