'Iterate through directory and return DataFrame with number of lines per file

I have a directory containing several excel files. I want to create a DataFrame with a list of the filenames, a count of the number of rows in each file, and a min and max column.

Example file 1:

Example File 1

Example file 2:

Example file 2

Desired result:

Desired Result

This is as far as I've gotten:

fileslist = os.listdir(folder)
for file in fileslist:
     str = file
     if not str.startswith('~$'):
         df = pd.read_excel(os.path.join(folder,file), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
         NumMax = max(df['Number'])
         NumMin = min(df['Number'])
         NameCount = df['Name'].count()

From here, I can't figure out how to create the final DataFrame as shown in the above "Desired Result." I'm very new at this and would appreciate any nudge in the right direction.



Solution 1:[1]

First of all, I would just like to point out that you shouldn't name any variable as "str" as you did here:

str = file

This can cause issues in the future if you ever try to convert some object to a string using the str(object) as you are redefining the method. Also, this redefinition of "file" is unnecessary, so you can just take that out. You did something similar with "file" as that is also a keyword that you are redefining. A name like "file_name" would be better.

As for how to create the final dataframe, it is somewhat simple. I would recommend you use a list and dictionaries and add all the data to that, then create the dataframe. Like this:

fileslist = os.listdir(folder)
# temporary list to store data
data = []
for file_name in fileslist:
     if not file_name.startswith('~$'):
         df = pd.read_excel(os.path.join(folder,file_name), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
         NumMax = max(df['Number'])
         NumMin = min(df['Number'])
         NameCount = df['Name'].()
         # appending row of data with appropriate column names
         data.append({'Filename': file_name, 'Count': NameCount, 'MinNumber': NumMin, 'MaxNumber': NumMax})
# creating actual dataframe
df = pd.DataFrame(data)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jbyler57