'Link rule_1 output targetting files with rule_2 input targetting the files' folder [snakemake]

I am trying to create a workflow in snakemake with two rules:

  • pool_files that creates, from a list of genomes saved in different folders, a copy of each genome into a same folder
  • run_pairwise that takes the path of the folder containing the genome copies, runs a function (in my case in ANI calculation, but is not relevant) and save all the results in a output folder

My issue is that input and output of the first rule pool_files are single files, while the input and output of the second rule run_pairwise are folders. My workaround is to provide both the copied files of pool_files and the output folder of run_pairwise as inputs for rule all, however, in the best case scenario, I am getting an error like:

ChildIOException: File/directory is a child to another output

The table (object gnm_table in the example below) that I read in and that contains the path of all genomes looks like this:

                  dir          file
0  _input/genomes/ref   aaa_v1.0.fa
1      _input/genomes        bbb.fa
2      _input/genomes        ccc.fa
3      _input/genomes        ddd.fa

While a temptative code that I came up with so far looks like this:

import os

rule all:
    input:
        expand("_results/pool_gnms/{target}", target=gnm_table.file),
        "_plots/ANI"


rule pool_files:
input:
    i_gnm = lambda wildcards: os.path.join(gnm_table.dir[gnm_table.file == wildcards.target].to_string(), wildcards.target)
output:
    gnm_link = "_results/pool_gnms/{target}",
shell:
    'ln -s '
    '{input.i_gnm} '
    '{output.gnm_link}'


rule calculate_ANI:
input:
    pool_dir = "_results/pool_gnms",
output:
    ANI_dir = directory("_results/ANI")
shell:
    'average_nucleotide_identity.py '
    '-o {output.ANI_dir} '
    '-i {input.pool_dir}'

What strategy should I follow to accomplish this task? Maybe I should use a checkpoint? Many thanks for any input!



Solution 1:[1]

There is no need in checkpoints. Checkpoints are needed when you don't know the files that a rule would create (e.g. you don't know the number of clusters that the algorithm finds). In your case you have everything you need in the gnm_table. You may define a rule that claims all files that need to be copied in rule pool_files as its input. The output of this rule may be a flag, and this flag could be an input for rule calculate_ANI.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dmitry Kuzminov