'Is it possible to use wildcard_constraints to exclude certain keywords from being matched Snakemake?
I have a rule that calculates new variables based on a set of variables, these variables are separated into different files. I have another rule for calculating the averages of all the different variables in the whole database. My problem is that snakemake tries to find my derived variables in the original database, which of course are not there.
Is there a way to have constrained the averaging rule such that it will calculate the average for all variables except for a list of the variables that are derived
Psudo code of how the rule look like
rule calc_average:
input:
pi_clim_var = lambda w: get_control_path(w, w.variable),
output:
outpath = outdir+'{experiment}/{variable}/{variable}_{experiment}_{model}_{freq}.nc'
log:
"logs/calc_average/{variable}_{model}_{experiment}_{freq}.log"
wildcard_constraints:
variable= '!calculated1!calculated2' # "orrvar1|orrvar2"....
notebook:
"../notebooks/calc_clim.py.ipynb"
I can of make a list of all the variables that I would like to have in the database and then do:
wildcard_constraints:
variable="|".join(list_of_vars)
But I was wondering if it is possible to do it the other way round? E.g:
wildcard_constraints:
variable="!".join(negate_list_of_vars) # don't match these wildcards
EDIT:
The get_control_path(w, w.variable)
constructs the path to the input file based on a lookup table that uses the wildcards as a keys.
def get_control_path(w, variable, grid_label=None):
if grid_label == None:
grid_label = config['default_grid_label']
try:
paths = get_paths(w, variable,'piClim-control', grid_label,activity='RFMIP', control=True)
except KeyError:
paths = get_paths(w, variable,'piClim-control', grid_label,activity='AerChemMIP', control=True)
return paths
def get_paths(w, variable,experiment, grid_label=None, activity=None, control=False):
"""
Get CMIP6 model paths in database based on the lookup tables.
Parameters:
-----------
w : snake.wildcards
a named tuple that contains the snakemake wildcards
"""
if w.model in ["NorESM2-LM", "NorESM2-MM"]:
root_path = f'{ROOT_PATH_NORESM}/{CMIP_VER}'
look_fnames = LOOK_FNAMES_NORESM
else:
root_path = f'{ROOT_PATH}/{CMIP_VER}'
look_fnames = LOOK_FNAMES
if activity:
activity= activity
else:
activity = LOOK_EXP[experiment]
model = w.model
if control:
variant=config['model_specific_variant']['control'].get(model, config['variant_default'])
else:
variant = config['model_specific_variant']['experiment'].get(model, config['variant_default'])
table_id = TABLE_IDS.get(variable,DEFAULT_TABLE_ID)
institution = LOOK_INSTITU[model]
try:
file_endings = look_fnames[activity][model][experiment][variant][table_id]['fn']
except:
raise KeyError(f"File ending is not defined for this combination of {activity}, {model}, {experiment}, {variant} and {table_id} " +
"please update config/lookup_file_endings.yaml accordingly")
if grid_label == None:
grid_label = look_fnames[activity][model][experiment][variant][table_id]['gl'][0]
check_path = f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}'
if os.path.exists(check_path)==False:
grid_labels = ['gr','gn', 'gl','grz', 'gr1']
i = 0
while os.path.exists(check_path)==False and i < len(grid_labels):
grid_label = grid_labels[i]
check_path = f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}'
i += 1
if control:
version = config['version']['version_control'].get(w.model, 'latest')
else:
version = config['version']['version_exp'].get(w.model, 'latest')
fname = f'{variable}_{table_id}_{model}_{experiment}_{variant}_{grid_label}'
paths = expand(
f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}/{version}/{fname}_{{file_endings}}'
,file_endings=file_endings)
# Sometimes the verisons are just messed up... try one more time with latest
if not os.path.exists(paths[0]):
paths=expand(
f'{root_path}/{activity}/{institution}/{model}/{experiment}/{variant}/{table_id}/{variable}/{grid_label}/latest/{fname}_{{file_endings}}'
,file_endings=file_endings)
# Sometimes the file ending are different depending on varialbe
if not os.path.exists(paths[0]) and len(paths) >= 2:
paths = [paths[1]]
return paths
Solution 1:[1]
Your description is a little too abstract for me to fully grasp your intent. You may be able to use regex's to solve this, but depending on the number of variables to consider, it could be a very slow calculation to match. Here are some other ideas, if they don't seem right please update your question with a little more context (the rule requesting calc_average and the get_control_path function).
- Place derived and original files in different subdirectories. Then you can restrict the average rule to just be the original files.
- Incorporate the logic into an input function/expand. Say the requesting rule is doing something like
rule average:
input: expand('/path/to/{input}', input=[input for input in inputs if input not in negate_list_of_vars])
output: 'path/to/average'
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Troy Comi |