'how to efficiently append 200k+ json files to dataframe
I have over 200 000 json files stored across 600 directories on my drive and a copy in Google cloud storage.
i need to merge them and transform to CSV file; so i thougt of looping the following logic:
- open each file with pandas read_json
- apply some transformations ( drop some columns, add columns with values based on the filename, change order of columns)
- append this df to a master_df
and finally export the master_df to a csv.
each file is around 5000 rows when converted to DF - which would result in DF of around 300 000 000 rows. The files are grouped and reside in 3 directories ( and than depending on the file - in a couple more, hence on top level there are 3 dirs, but overall - about 600 ) - so to speed it up a bit i decided to run the script based on one of those 3 directories
I am running this script on my local machine (32GD RAM, sufficient disk space on SSD drive), yet its painfully slow, and the speed decreases. At the beginning one df was appended within 1 second, after executing loop 4000 times - time to append grew to 2.7s
I am looking for a way to speed it up to something that couple be reasonably done within (hopefully) a couple hours
so bottomline - should i try to optimize my code and run this locally, or not even bother, likely keep the script 'as is' and run it in f.ex Google Cloud Run?
the JSON files contain keys A, B, C, D, E. I drop A, B as not important, and rename C
the code i have so far:
import pandas as pd
import os
def listfiles(track, footprint):
"""
This function lists files in a specified filepath (track) by a footprint found in the filename with extension
Function returns a list of files.
"""
locations =[]
for r, d, f in os.walk(track):
for file in f:
if footprint in file:
locations.append(os.path.join(r, file))
return locations
def create_master_df(track, footprint):
master_df = pd.DataFrame(columns = ['date','domain','property','lang','kw', 'clicks','impressions'])
all_json_files = listfiles(track, footprint)
prop = [] #this indicates the starting directory, and to which ouput file should master_df be saved
for i, file in enumerate(all_json_files):
#here starts logic by which i identify some properties of the file, to use them as values in columns in local_df
elements=file.split('\\')
elements[-1] = elements[-1].replace('.json','')
elements[-1] = elements[-1].replace('-Copy-Multiplicated','')
date = elements[-1][-10:]
domain=elements[7]
property=elements[6]
language=elements[8]
json_to_rows = pd.read_json(file)
local_df = pd.DataFrame(json_to_rows.rows.values.tolist())
local_df = local_df.drop(columns=['A','B'])
local_df = local_df.rename(columns={'C': 'new_name'})
#here i add earlier extracted data, to be able to distinguish rows in the master DF
local_df['date'] = date
local_df['domain'] = domain
local_df['property'] = property
local_df['lang'] = language
local_df = local_df[['date','domain','property','lang','new_name', 'D','E']]
master_df = master_df.append(local_df, ignore_index=True)
prop = property
print(f'appended {i} times')
out_file_path = f'{track}\\{prop}-outfile.csv'
master_df.to_csv(out_file_path, sep=";", index=False)
#run the script on first directory
track = 'C:\\xyz\\a'
footprint ='.json'
create_master_df(track, footprint)
#run the script on first directory
track = 'C:\\xyz\\b'
create_master_df(track, footprint)
any feedback is welcome!
#Edit:
i tried and timed a couple ways of approaching it, below is explanation of what i did:
- clear execution - at first i ran my code as it was written
- 1st try - i moved all changes to local_df to before the final master_df was saved to csv. This was an attempt to see if working on files 'as they are' without manipulating them would be faster. i removed from for loop dropping columns, changing colum names, and reordering them. all this was applied on master_df before exporting to csv
- 2nd try - i brought back to for loop dropping unnecessary columns - as clearly performance was impacted by additional columns
- 3rd try - so far the most efficient - instead of appending local_df to master_df, i transformed it to local_list, and appended to master_list. master_list was than pd.concat to master_df and exported to csv
here is a comparison of speed - how fast the script was to iterate the for loop in each version, when tested on ~ 800 JSON files: comparison of timed script executions
Its surprising to see that i actually wrote a solid code in the first place - as i don't have much experience coding - that was super nice to see :)
overall when job was ran on the test set of files it finished as follows (in seconds)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
