'What are different methods used for naming snakemake pipeline output files that depends on multiple variables?

I wrote a snakemake pipeline which is intended to be run again with different variables provided by the user in a new config file during each run.

config.yml:

param_a: 100 #filter dataset rule1
param_b: 200 #filter sample rule2
param_c: 300 #filter sample again rule3

config2.yml:

param_a: 150 #100->150
param_b: 200
param_c: 300

Snakefile:

rule rule1:
    #dataset is filtered by param_a
    output: {dataset}_{param_a}/{sample}

rule rule2:
    #sample is filtered by param_a
    output: {dataset}_{param_a}/{sample}_{param_b}

rule rule3:
    #sample is then filtered by param_c
    output: {dataset}_{param_a}/{sample}_{param_b}_{param_c}

The aim is making it possible for user to rerun the analyses with different options at different steps without having to run everything until the step with the param change again.

When we have too many of such parameters the directory and file names start to get too long, e.g.:

dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase100/mysample.bam
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase200/mysample.bam

Is there any method for easier and more efficient naming, such as auto creating version names and saving parameter details to a text file?

I read about the shadow directory feature but I don't think it does what I am looking for.



Solution 1:[1]

If you want to be very fancy, you could encode the params into a SHA hash or similar and use that for the filename, recording the hash and parameter values in a table. You just need a function to take keyword params and translate that to the hash and use it for all your rule inputs. If I were you, I would use directories instead of flat filenames.

dataset1/sample-minSize200/samtools-F4-F1024-q20/mosdepth-minDepth4-maxDepth100/bedtools-merge-gap200/angsd-minQ20/loci-maxBase100/mysample.bam

That would make it easier to discard all of some parameter set that you don't need anymore and will make directory listing faster.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Troy Comi