Nextflow Development - Metadata Parsing

**Metadata Parsing

Currently, we have defined the reads parameter as a string:

params.reads = "/.../training/nf-training/data/ggal/gut_{1,2}.fq"

To group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:

reads_ch = Channel.fromFilePairs("$params.reads")
reads_ch.view()

The reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:

[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]

Glob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.

>>> ls "/.../training/nf-training/data/ggal/"

gut_1.fq  gut_2.fq  liver_1.fq  liver_2.fq  lung_1.fq  lung_2.fq  transcriptome.fa

Run the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:

nextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'

File paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.

The reads_ch now contains three tuple elements with unique grouping keys:

[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]
[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]
[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]

The grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.

sample_name,fastq1,fastq2
gut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq
liver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq
lung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq

Let’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.

params.reads = "/.../rnaseq_samplesheet.csv"

Previously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.

reads_ch = Channel.fromPath(params.reads)
reads_ch.view()

reads_ch = reads_ch.splitCsv(header:true)
reads_ch.view()

When using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.

>>> nextflow run rnaseq.nf

N E X T F L O W  ~  version 23.04.1
Launching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2
reads: rnaseq_samplesheet.csv
reads: $params.reads
executor >  local (1)
[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔
/.../rnaseq_samplesheet.csv
[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]
[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]
[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]

The /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.

To create grouping key metadata from the list output by splitCsv, the map channel operator can be used.

  reads_ch = reads_ch.map { row -> 
      grp_meta = "$row.sample_name"
      [grp_meta, [row.fastq1, row.fastq2]]
      }
  reads_ch.view()

Here, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.

Let’s run the pipeline containing the custom grouping key:

>>> nextflow run rnaseq.nf

N E X T F L O W  ~  version 23.04.1
Launching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97
reads: rnaseq_samplesheet.csv
reads: $params.reads
[-        ] process > INDEX -
[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]
[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]
[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]

The custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample.