'How to use a wildcard within expand function parameters in snakemake?
I have a json file like so:
{
"foo": {
"bar1":
{"A1": {"name": "A1", "path": "/path/to/A1"},
"B1": {"name": "B1", "path": "/path/to/B1"},
"C1": {"name": "C1", "path": "/path/to/C1"},
"D1": {"name": "D1", "path": "/path/to/D1"}},
"bar2":
{"A2": {"name": "A2", "path": "/path/to/A2"},
"B2": {"name": "B2", "path": "/path/to/B2"},
"C2": {"name": "C2", "path": "/path/to/C2"},
"D2": {"name": "D2", "path": "/path/to/D2"}}}
}
I am trying to run my snakemake pipeline on the samples in sample sets 'bar1' and 'bar2' separately, putting the results into their own folders. When I expand my wildcards I don't want all iterations of sample sets and samples, I just want them in their specific groups, like this:
tmp/bar1/A1.bam
tmp/bar1/B1.bam
tmp/bar1/C1.bam
tmp/bar1/D1.bam
tmp/bar2/A2.bam
tmp/bar2/B2.bam
tmp/bar2/C2.bam
tmp/bar2/D2.bam
Hopefully my snakefile will help explain. I have tried having my snakefile like this:
sample_sets = [ i for i in config['foo'] ]
samples_dict = config['foo'] #cleans it up
def get_samples(wildcards):
return list(samples_dict[wildcards.sample_set].keys())
rule all:
input:
expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = get_samples), sample_set = sample_sets),
This doesn't work, my file names end up with "<function get_samples at 0x7f6e00544320>" in them! I have also tried:
rule all:
input:
expand(expand("tmp/{{sample_set}}/{sample}.bam", sample = list(samples_dict["{{sample_set}}"].keys()), sample_set = sample_sets),
but that get's a KeyError. Have also tried this:
rule all:
input:
[ ["tmp/{{sample_set}}/{sample}.aligned_bam.core.bam".format( sample = sample ) for sample in list(samples_dict[sample_set].keys())] for sample_set in sample_sets ]
which gets an "Wildcards in input files cannot be determined from output files: 'sample_set'" error.
I feel like there must be a simple way of doing this and perhaps I'm being a moron.
Any help would be very much appreciated! And let me know if I've missed some detail.
Solution 1:[1]
@SultanOrazbayev has the right of it, but just to throw in a couple of alternatives.
If you like the loops, the pythonic way to write it is with list comprehensions. If you have giant file lists you may notice an improvement in performance.
list_files = [
f"tmp/{key}/{nested_key}.bam"
for key in d["foo"]
for nested_key in d["foo"][key]
]
The only way I can think to use expand is basically constructing the same list. I pass it in as a dict too keep the wildcard names, though a tuple would be more efficient. The advantage of expand would be if you have your file names in a config variable and can't easily format it, want to keep meaningful wildcard names, or use allow_missing for other wildcards:
wcs = [{'sample_set': sample_set, 'sample': sample}
for sample_set in d["foo"]
for sample in d["foo"][sample_set]
]
list_files = expand("tmp/{sample_set}/{sample}.bam", zip,
sample_set=[wc['sample_set'] for wc in wcs],
sample=[wc['sample'] for wc in wcs],
)
Sometimes the snakemake way isn't pythonic!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Troy Comi |
