'Using snakemake to rename files according to defined mapping
I'm trying to use snakemake to download a list of files, and then rename them according to mapping given in the file. I first read a dictionary from a file that has the form of {ID_for_download : sample_name}, and I pass the list of its keys to first rule for download (because downloading is taxing, I'm just using a dummy script to generate empty files). For every file in the list, two files are downloaded in the form of {file_1.fastq} and {file_2.fastq} When those files are downloaded, I then rename them using the second rule - here I take advantage of being able to run python code in a rule using run key word. When I do a dry-run using -n flag, everything works. But when I do an actual run, I get an error of the form
Job Missing files after 5 seconds [list of files]
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 0 completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Removing output files of failed job rename_srafiles_to_samples since they might be corrupted: [list of all files]
What happens is that a directory to store my files is created, and then my files are "downloaded", and then are renamed. Then when it reaches the last file, I get this error and everything is deleted. The snakemake file is below:
import csv
import os
SRA_MAPPING = read_dictionary() #dictionary read from a file
SRAFILES = list(SRA_MAPPING.keys())[1:] #list of sra files
SAMPLES = [SRA_MAPPING[key] for key in SRAFILES] #list of sample names
rule all:
input:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"bash dummy_download.sh"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("raw_samples/{samples}_1.fastq",samples=SAMPLES),
expand("raw_samples/{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name=file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name=file.replace(old_name,sample_name)
os.rename(file,new_name)
I've separately tried to run download_srafiles and it worked. I also separately tried to run rename_srafiles_to_samples and it worked. But when I run those files in conjunction, I get the error. For completeness, the script dummy_download.sh is below:
#!/bin/bash
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
for file in "${samples[@]}"
do
touch raw_samples/${file}_1.fastq
touch raw_samples/${file}_2.fastq
done
(linker.csv is a file in one column has ID_for_download and in other column has sample_name)
What am I doing wrong?
EDIT: Per user dariober, the change of directories via python's os in the rule rename_srafiles_to_samples "confused" snakemake. Snakemake's logic is sound - if I change the directory to enter raw_samples, it tries to find raw_samples in itself and fails. To that extend, I tested different versions.
Version 1
Exactly as dariober explained. Important bits of code:
for file in os.listdir('raw_samples'):
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename('raw_samples/' + file, 'raw_samples/' + new_name)
It lists files in "raw_samples" directory, and then renames them. Crucial thing to do is to add prefix of directory (raw_samples/) to each rename.
Version 2
The same as my original post, but instead of leaving working directory, I exit it at the end of the loop. It works.
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
os.chdir("..")
Version 3
Same as my original post, but instead of modifying anything in the run segment, I modify the output to exclude file directory. This means that I have to modify my rule all too. It didn't work. Code is below:
rule all:
input:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES),
rule download_srafiles:
output:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
shell:
"touch {output}"
rule rename_srafiles_to_samples:
input:
expand("raw_samples/{srafiles}_1.fastq",srafiles=SRAFILES),
expand("raw_samples/{srafiles}_2.fastq",srafiles=SRAFILES)
output:
expand("{samples}_1.fastq",samples=SAMPLES),
expand("{samples}_2.fastq",samples=SAMPLES)
run:
os.chdir(os.getcwd()+r"/raw_samples")
for file in os.listdir():
old_name= file[:file.find("_")]
sample_name=SRA_MAPPING[old_name]
new_name= file.replace(old_name,sample_name)
os.rename(file,new_name)
The error it gives is:
MissingOutputException in line 24
...
Job files missing
The files are actually there. So I don't know if I made some error in the code or is this some bug.
Conclusion
I wouldn't say that this is a problem with snakemake. It's more of a problem with my poorly thought out process. In retrospect, it makes perfect sense that entering directory messes up output/input process of snakemake. If I want to use os module in snakemake to change directories, I have to be very careful. Enter wherever I need to, but ultimately go back to my original starting place. Many thanks to /u/dariober and /u/SultanOrazbayev
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
