'fastest way to match data from two massive lists with differing data types?
I have data regarding a directory structure of unknown (and massive) size and data regarding the same structure from perforce. Using Python, I need to be able to match the local data with the perforce data and generate a list of files that reflects all of the data on the users workspace (local directory), including all of the files missing from perforce, as well as all the data in the depot that is missing from the workspace.
Local Directory Structure Data:
- I have full control over how I mine out that data (currently using os.walk)
Perforce Data:
- Not much control over how the data is returned
- Currently comes as a list of dictionaries
- Data returns very fast regardless of size.
#this list is hundreds of thousands of entries.
p4data_example = [{'depotFile': '//Path/To/Data/file.extension', 'clientFile': 'X:\\Path\\To\\Data\\file.extension', 'isMapped': '', 'headAction': 'add', 'headType': 'text', 'headTime': '00000', 'headRev': '1', 'headChange': '0000', 'headModTime': '00000', 'haveRev': '', 'otherOpen': ['stuff'], 'otherAction': ['move/delete'], 'otherChange': ['00000'], 'otherOpens': '1'}]
I need to operate on the local directory files whether or not they have matching p4 data.
path_to_data = "X:\Path\To\Data"
p4data = p4.run('fstat', "%s\..." % path_to_data)
for root, dirs, files in os.walk(path_to_data, topdown = False):
for file in files:
os.path.join(root,file)
matchingp4 = None
for p4item in p4Data:
if p4item['clientFile'] == file_name:
matchingp4 = p4item
break
do_stuff_with_data(foo, bar)
I am confident this is not the most efficient way to handle this.
The extended time seems to come from:
- Getting all of the local data
- Needing to loop over the data so many times to find matches.
I need this to run as fast as possible. Ideally this would run in just a couple seconds but I understand that not knowing how large the data set can get will cause this to vary by an unknown amount.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
