'How to read csv files with specific pattern from nested file directories in pandas?

I intend to read csv file with specific pattern from nested file directories where each subdirectories has multiple csv files and I only want to read the one end with specific pattern. I already figured out the way of doing this in R, but wanted to do this in pandas. I found couple of useful post but can't able to read the file that I wanted to read in pandas.

current attempt

here is file structure that I have and wanted to read file start with Z_19_xx.csv. for instance:

import pandas as pd

dir1="demo2020/p1 pop/csv/Z_19_master.csv"
f1=pd.read_csv(dir1)

this is hard coded and I want to avoid of doing this. Below is the file structure:

demo2020
    - p1 pop
        -csv
            - A_17_master.csv
            - A_18_master.csv
            - B_18_master.csv
            - C_19_master.csv
            - Z_19_master.csv
    - p2 cop
        -csv
            - A_17_cop.csv
            - A_18_cop.csv
            - B_18_cop.csv
            - C_19_cop.csv
            - Z_19_cop.csv
    - p3 res
        -csv
            - A_17_res.csv
            - A_18_res.csv
            - B_18_res.csv
            - C_19_res.csv
            - Z_19_res.csv
    - p4 nac
        -csv
            - A_17_nac.csv
            - A_18_nac.csv
            - B_18_nac.csv
            - C_19_nac.csv
            - Z_19_nac.csv

my current attempt in R:

here is my R code to do this in handy:

yr=19
dir="demo2020/"
files <-c(f1  = paste0("p1 pop/csv/Z_", yr, "_master.csv") , 
                    f2 = paste0('p2 cop/csv/Z_', yr,'_cop.csv') ,
                    f3 = paste0('p3 res/csv/Z_', yr,'_res.csv') , 
                    f4  = paste0('p4 nac/csv/Z_', yr,'_nac.csv') 
)

path=(paste0(dir,files))
> path
[1] "demo2020/p1 pop/csv/Z_19_master.csv"
[2] "demo2020/p2 cop/csv/Z_19_cop.csv"   
[3] "demo2020/p3 res/csv/Z_19_res.csv"   
[4] "demo2020/p4 nac/csv/Z_19_nac.csv"

# read them

for(i in 1:length(files))
{
    f <- assign(names(files[i]), read.csv(paste0(dir, files[i]),stringsAsFactors = FALSE,skip = 1))
}

python objective - pandas

I want to do this in python without hard coding, and simply want to use above R code logic in python and use pandas to read csv files. So far, here is my attempt:

import pandas
import os

parent_dir = 'demo2020/'
subject_dirs = [os.path.join(parent_dir, dir) for dir in os.listdir(parent_dir) if os.path.isdir(os.path.join(parent_dir, dir))]

filelist = []
for dir in subject_dirs:
    csv_files = [os.path.join(dir, csv) for csv in os.listdir(dir) if os.path.isfile(os.path.join(dir, csv)) and and csv.startswith('Z_') and csv.endswith('.csv')]
    for file in csv_files:
        df=pd.read_csv(file)
        filelist.append(df)

but still not getting this right, I only want to read Z_19_xx.csv from each subfolder and concatenate them. How can we do this nicely in python? Can anyone point me out of making this right in python? Any idea?



Solution 1:[1]

You can use Glob() function to find files recursively in Python. With glob, we can also use wildcards ("*, ?, [ranges]) apart from exact string search to make path retrieval more simple and convenient.

If you want to match the files: Z_19_xx.csv instead of Z_xx.csv, you can use the following:

import glob

csv_files = glob.glob('demo2020/p*/csv/Z_19_*.csv')

filelist = []
for file in csv_files:
    df = pd.read_csv(file)
    filelist.append(df)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jane Luo