'How to define parameters for a snakemake rule with expand input

I have input files in this format:

dataset1/file1.bam
dataset1/file1_rmd.bam
dataset1/file2.bam
dataset1/file2_rmd.bam

I would like to run a command with each and merge the results into a csv file. The command returns an integer given filename.

$ samtools view -c -F1 dataset1/file1.bam
200

I would like to run the command and merge the output for each file into the following csv file:

file1,200,100
file2,400,300

I can do this without using the expand in input and using an append operator >>, but in order to avoid possible file corruptions it can lead to I would like to use >.

I tried something like this which did not work due to wildcards.param2 part:

rule collect_rc_results:
    input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
            in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
    output: "{param1}_merged.csv"
    shell:
        """
        RCT=$(samtools view -c -F1  {input.in1})
        RCD=$(samtools view -c -F1  {input.in2})
        printf "{wildcards.param2},${{RCT}},${{RCD}}\n" > {output}
        """

I am aware that the input is no longer a single file but a list of files created by expand. Therefore I defined a function to work on a list input, but it is still not quite right:

def get_read_count:
        return [ os.popen("samtools view -c -F1 "+infile).read() for infile in infiles ]

rule collect_rc_results:
    input: in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
            in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
    output: "{param1}_merged.csv"
    params: rc1=get_read_count("{param1}/{param2}.bam"), rc2=get_read_count("{param1}/{param2}_rmd.bam")
    shell:
        """
        printf "{wildcards.param2},{params.rc1},{params.rc2}\n" > {output}
        """

What is the best practice to use the wildcards inside the input file IDs when the input file list is defined with expand?

Edit: I can get the expected result with expand if I use an external bash script, such as script.sh

for INF in "${@}";do
        IN1=${INF}
        IN2=${IN1%.bam}_rmd.bam
        LIB=$(basename ${IN1%.*}|cut -d_ -f1)
        RCT=$(samtools view -c -F1 ${IN1} )
        RCD=$(samtools view -c -F1 ${IN2} )
        printf "${LIB},${RCT},${RCD}\n"
done

with

        params: script="script.sh"
        shell:
                """
                bash {params.script} {input} > {output} 
                """

but I am interested in learning if there is an easier way to get the same output only using snakemake.

Edit2:

I can also do it in the shell instead of a separate script,

rule collect_rc_results:
        input: 
        in1=expand("{param1}/{param2}.bam", param1=PARS1, param2=PARS2),
        in2=expand("{param1}/{param2}_rmd.bam", param1=PARS1, param2=PARS2)
        output: "{param1}_merged.csv"
        shell:
            """
            for INF in {input};do
                IN1=${{INF}}
                IN2=${{IN1%.bam}}_rmd.bam
                LIB=$(basename ${{IN1%.*}}|cut -d_ -f1)
                RCT=$(samtools view -c -F1 ${{IN1}} )
                RCD=$(samtools view -c -F1 ${{IN2}} )
                printf ${{LIB}},${{RCT}},${{RCD}}\n"
            done > {output}
            """

thus get the expected files. However, I would be interested to hear about it if anyone has a more elegant or a "best practice" solution.



Solution 1:[1]

I don't think there is anything really wrong with your current solution, but I would be more inclined to use a run directive with shell functions to perform your loop.

@bli's suggestion to use temp files is also good, especially if the intermediate step (samtools in this case) is long running; you can get tremendous wall-clock gains from parallelizing those computations. The downside is you will be creating lots of tiny files.

I noticed that your inputs are fully qualified through expand, but based on your example I think you want to leave param1 as a wildcard. Assuming PARS2 is a list, it should be safe to zip in1, in2 and PARS2 together. Here's my take (written but not tested).

rule collect_rc_results:
    input: 
        in1=expand("{param1}/{param2}.bam", param2=PARS2, allow_missing=True),
        in2=expand("{param1}/{param2}_rmd.bam", param2=PARS2, allow_missing=True)
    output: "{param1}_merged.csv"
    run:
        with open(output[0], 'w') as outfile:
            for infile1, infile2, parameter in zip(in1, in2, PARS2):
                # I don't usually use shell, may have to strip newlines from this output?
                RCT = shell(f'samtools view -c -F1 {infile1}')
                RCD = shell(f'samtools view -c -F1 {infile2}')
                outfile.write(f'{parameter},{RCT},{RCD}\n')

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Troy Comi