Main inputs
JSON
As explained by the Workflow Description Language (WDL) authors, cromwell uses a JSON formatted file as main input. In MIGNON, this file contains absolute paths to the files that are used during the analysis, decision variables to control which workflow steps are carried out (e.g. wether to perform or not the variant calling) and arguments to control each task execution (e.g. the amount of threads to use for the alignment).
Here you can find an example of the content of a MIGNON input JSON file. To build your own input file, please go to the preparing the JSON section.
{
"MIGNON.execution_mode": "hisat2",
"MIGNON.is_paired_end": true,
"MIGNON.do_vc": true,
"MIGNON.input_fastq_r1": [
"flagForPwd/mignon_test_data/subset_fastq/SRR8615222_1.fastq.gz",
"flagForPwd/mignon_test_data/subset_fastq/SRR8615223_1.fastq.gz",
"flagForPwd/mignon_test_data/subset_fastq/SRR8615224_1.fastq.gz",
"flagForPwd/mignon_test_data/subset_fastq/SRR8615225_1.fastq.gz"
],
"MIGNON.input_fastq_r2": [
"flagForPwd/mignon_test_data/subset_fastq/SRR8615222_2.fastq.gz",
"flagForPwd/mignon_test_data/subset_fastq/SRR8615223_2.fastq.gz",
"flagForPwd/mignon_test_data/subset_fastq/SRR8615224_2.fastq.gz",
"flagForPwd/mignon_test_data/subset_fastq/SRR8615225_2.fastq.gz"
],
"MIGNON.sample_id": [
"Control_1",
"Control_2",
"Problem_1",
"Problem_2"
],
"MIGNON.group": [
"Control",
"Control",
"Problem",
"Problem"
],
"MIGNON.gtf_file": "flagForPwd/mignon_test_data/Homo_sapiens.GRCh38.99.chr.gtf",
"MIGNON.hisat2_index_path": "flagForPwd/mignon_test_data/hisat_index",
"MIGNON.hisat2_index_prefix": "genome",
"MIGNON.vep_cache_dir": "flagForPwd/mignon_test_data/subset_vep_cache",
"MIGNON.ref_fasta": "flagForPwd/mignon_test_data/subset_genome.fa",
"MIGNON.ref_fasta_index": "flagForPwd/mignon_test_data/subset_genome.fa.fai",
"MIGNON.ref_dict": "flagForPwd/mignon_test_data/subset_genome.dict",
"MIGNON.db_snp_vcf": "flagForPwd/mignon_test_data/subset_variants.vcf.gz",
"MIGNON.db_snp_vcf_index": "flagForPwd/mignon_test_data/subset_variants.vcf.gz.tbi",
"MIGNON.known_vcfs":["flagForPwd/mignon_test_data/subset_variants.vcf.gz"],
"MIGNON.known_vcfs_indices": ["flagForPwd/mignon_test_data/subset_variants.vcf.gz.tbi"],
"MIGNON.edger_script": "flagForPwd/scripts/edgeR.r",
"MIGNON.hipathia_script": "flagForPwd/scripts/hipathia.r"
}
Cromwell configuration file
The configuration file tells cromwell how to handle and launch the jobs that make up the workflow. When executing a single job, it places the values from the WDL file in the appropiated position within the container engine command. For example, the requested_memory
runtime input, which controls the memory that is allocated for the execution of the task, is passed to the --memory
argument when using docker run
. On the other hand, when executing the workflow on a HPC environment, the requested_memory
runime input is used in the submission of the job with sbatch --mem
(SLURM command). In addition, this file can control other top-level features of cromwell, as for example how the duplicated files are managed or the number of parallel jobs that can be executed. Users can find more information about possible backends that can be configured through this file in the cromwell documentation.
To run MIGNON, we have prepared two different configuration files. The first one, LocalWithDocker.conf
, is used to run MIGNON locally using docker to execute the containerized software. The second one, SlurmAndSingularity.conf
, is intended to be used within HPC environments that use Slurm as workload manager and have Singularity installed as a module to run the containerized software.
Execution modes
There are 5 different execution modes of MIGNON. Each of them makes use of a different combination of tools in the intial steps of the workflow. For example, the “salmon-hisat2” execution mode uses HISAT2 to perform the alignment and Salmon to perform gene expression quantification from reads. The HISAT2 alignments can then be used to run the variant calling (VC) sub-workflow with GATK to obtain the genomic information. Each mode has a specific computational profile and it should be selected depending on the user needs. In a nutshell, on modes where STAR is used, the maximum memory requirements and velocity of the alignment are high. But, on modes where HISAT2 is employed, the memory requirements and velocity of alignments are lower, allowing a higher number of parallel jobs for the same computational price. The following table summarizes the basic properties of each execution mode:
Execution mode | Alignment | Quantification | Allows VC |
---|---|---|---|
“salmon-hisat2” | HISAT2 | Salmon | Yes |
“salmon-star” | STAR | Salmon | Yes |
“hisat2” | HISAT2 | featureCounts | Yes |
“star” | STAR | featureCounts | Yes |
“salmon” | - | Salmon | No |
In addition, the following figure shows which steps of the workflow are carried out under each execution mode:
Depending on the execution modes, the required values in the input JSON can change. We strongly recommend using the combined execution modes “salmon-star” or “salmon-hisat2”, as they use the pseudo-alignment strategy to quantify gene expression dealing with the multi-mapping reads problem, and star or hisat2 to obtain the alignments for the variant calling sub-workflow.
Preparing the JSON
The JSON file is used to store and register the file paths, workflow variables and execution parameters. For MIGNON, the values that should be included in this file can be divided in two different categories: Required values and default values.
Required values
The values detailed in this section are mandatory and will vary depending on the execution mode. Apart from the input reads or sample ids, it is important to pay attention to the different reference material that is required to perform the alignment, pseudo-alignment or variant calling. The following table contains the basic information about each value and the execution modes where it is required. The “preferred source” column details the source for the reference material that was used for running the workflow.
Input | Required at | Variable type | File format | Description | Preferred source |
---|---|---|---|---|---|
is_paired_end | All | Boolean | - | Are input reads paired-end? | - |
input_fastq_r1 | All | Array[File] | fastq | Array of paths indicating the location of the fastq files to be processed. If paired-end, the path to the (_1) files. | - |
input_fastq_r2 | All | Array[File] | fastq | If paired-end, the path to the (_2) files. The position of each element in the array should match its pair in the input_fastq_r1 variable. | - |
sample_id | All | Array[String] | - | Array of sample identifiers. Those identifiers will be used across the different tasks to the pipeline to identify each input sample. The position of each element in the array should match its pair in the input_fastq_r1 variable. | - |
group | All | Array[String] | - | Array of string indicating the group of samples for each read file. Those groups will be used to perform the differential expression and signaling analyses. The position of each element in the array should match its pair in the input_fastq_r1 variable. | - |
execution_mode | - | String | - | String indicating the execution mode to be used. | - |
do_vc | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | Boolean | - | Perform the variant calling? Only in case users do not want to extract and use variants from RNA-Seq data. | - |
gtf_file | All | File | gtf | Annotation file for the genome used at alignment and variant calling. It is also the input to create the transcript-to-gene file neccesary for the salmon quantifications. | ENSEMBL |
ref_fasta | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | File | fasta | Reference genome used to perform the variant calling. | ENSEMBL |
ref_fasta_index | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | File | fai | Reference genome index used to perform the variant calling. | Created from ref_fasta with samtools |
ref_dict | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | File | dict | Reference genome dictionary used to perform the variant calling. | Created from ref_fasta with samtools |
db_snp_vcf | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | File | vcf | Database of SNPs used to perform the variant calling. | NCBI |
db_snp_vcf_index | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | File | tbi | SNP database index. | NCBI |
known_vcfs | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | Array[File] | vcf | Databases of INDELs used to perform the variant calling. | NCBI |
known_vcfs_indices | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | Array[File] | tbi | INDEL databases indices. | NCBI |
vep_cache_dir | “salmon-star”, “salmon-hisat”, “hisat2”, “star” | String | - | Path to the vep cache directory containing the variant annotations to be used. Should contain the SIFT and PolyPhen scores. | ENSEMBL |
hisat2_index_path | “salmon-hisat”, “hisat2” | String | - | Path to the directory where the HISAT2 index is stored. | Created from ref_fasta with HISAT2 |
hisat2_index_prefix | “salmon-hisat”, “hisat2” | String | - | HISAT2 index prefix. | - |
star_index_path | “salmon-star”, “star” | String | - | Path to the directory where the STAR index is stored. | Created from ref_fasta with STAR |
salmon_index_path | “salmon”, “salmon-star”, “salmon-hisat2” | String | - | Path to the directory where the Salmon index is stored. | Created from ref_fasta with Salmon |
edger_script | All | File | - | Script executed in the EdgeR task. | Included in MIGNON |
tximport_script | “salmon”, “salmon-star”, “salmon-hisat2” | File | - | Script executed in the TxImport task. | Included in MIGNON |
hipathia_script | All | File | - | Script executed in the hiPathia task. | Included in MIGNON |
ensemblTx_script | “salmon”, “salmon-star”, “salmon-hisat2” | File | - | Script executed in the ensembldb task. It transforms the provided GTF into a Tx2Gene file used by tximport. | Included in MIGNON |
Default values
Some inputs have a default value that can be modified (or not) by users. This section comprises different types of input, from those able to modify the threads and memory used by the tools to default values that are used in the multi-omic integrative step.
Execution values
Some of the default values have the ability to modify the computational resources that every software uses to carry out a particular task within the workflow. In this group we have mainly three types of values: cpu, mem and additional_parameters. The cpu value is placed in the proper position on each command of the workflow, to modify, when possible, the number of threads used by the tool that performs the task. mem indicates the amount of memory that is allocated to the execution of this task. Finally, additional_parameters allow users to include additional arguments to the execution of some tools within the workflow. All the execution inputs are defined with the same structure:
nameOfTheTask_value
And can be found in the following list:
Int? fastp_cpu = 1
String? fastp_mem = "16G"
String? fastp_additional_parameters = ""
Int? fastqc_cpu = 1
String? fastqc_mem = "16G"
String? fastqc_additional_parameters = ""
Int? hisat2_cpu = 1
String? hisat2_mem = "16G"
String? hisat2_additional_parameters = ""
Int? sam2bam_cpu = 1
String? sam2bam_mem = "16G"
String? sam2bam_additional_parameters = ""
Int? star_cpu = 1
String? star_mem = "32G"
String? star_additional_parameters = ""
Int? salmon_cpu = 1
String? salmon_mem = "16G"
String? salmon_additional_parameters = ""
Int? featureCounts_cpu = 1
String? featureCounts_mem = "16G"
String? featureCounts_additional_parameters = ""
Int? vep_cpu = 1
String? vep_mem = "16G"
Int? ensemblTx_cpu = 1
String? ensemblTx_mem = "16G"
Int? edger_cpu = 1
String? edger_mem = "16G"
Int? tximport_cpu = 1
String? tximport_mem = "16G"
Int? hipathia_cpu = 1
String? hipathia_mem = "16G"
Int? filterBam_cpu = 1
String? filterBam_mem = "16G"
String? filterBam_additional_parameters = ""
For example, to modify the execution of STAR, users can indicate threads, memory and include additional arguments by including the following lines in the input JSON.
"MIGNON.star_cpu": 4,
"MIGNON.star_mem": "64G",
"MIGNON.star_additional_parameters": "--limitBAMsortRAM 60359738368 --runRNGseed 333"
In addition to these, the haplotype_scatter_count value requires a special mention. As explained in the advanced section, this parameter allows users to apply the scatter and gather strategy to parallelize the GATK HaplotypeCaller sub-task. This input will determine the number of chunks in which the reference genome is divided to perform the variant calling from aligned reads.
Other values
There are some default values that are used to filter annotated variants and perform the in silico knockdown of genes that are affected by those. In addition, other defaults are required for the execution of the tasks but do not have a direct impact in the output. All of them can be found in the following section.
# variant filtration and knockdown values
Boolean? filter_unmapped = true
String? salmon_library_type = "A"
Int? edger_min_counts = 15
Boolean? hipathia_normalize = true
Float? hipathia_ko_factor = 0.01
Float vep_sift_cutoff = 0.05
Float vep_polyphen_cutoff = 0.95
# other values
File? tx2gene_file
String? rg_platform = "Unknown"
String? rg_center = "Unknown"
Int? min_confidence_for_variant_calling
File? ref_gz_index
File? intervals
File? input_bai
Those that actually have an impact on the output of the pipeline are detailed in the following table:
Input | Variable type | Default value | Description |
---|---|---|---|
filter_unmapped | Boolean | TRUE | Wether to filter or not unmapped reads from alignment files. |
salmon_library_type | String | A | Library type for salmon quantification. Defaults to “A” (automatic detection). |
edger_min_counts | Int | 15 | Minimum counts per gene. This parameter will be used to filter the count matrix before the differential expression analysis. By default, all those genes with less than 15 counts across all samples will be filtered. |
hipathia_normalize | Boolean | TRUE | Normalize circuit pathway activity by length? HiPathia applies a signal propagation algorithm across the different receptor-effector circuits in the pathways to calculate the signaling circuit activity. The topology and diameter of such circuits influence the resulting circuit activity values, so this normalization is recommended. |
hipathia_ko_factor | Float | 0.01 | LoF variants knockdown factor. This parameter controls the number by which the normalized expression values are multiplied when a loss of function (LoF) variant is detected for a gene/sample. |
vep_sift_cutoff | Float | 0.05 | Sift cutoff value. In combination with vep_polyphen_cutoff, is used to filter variants that will be considered as deleterious (LoF). |
vep_polyphen_cutoff | Float | 0.95 | Polyphen cutoff value. In combination with vep_sift_cutoff, is used to filter variants that will be considered as deleterious (LoF). |
For more information about the integrative strategy applied by MIGNON, please read the article that accompanies this software.