'Find pairs in 2 files that are too large to fit in memory
I'm looking for a memory efficient way to find pairs in 2 possibly very large files (can be as large as 1TB each) that do not fit into memory. For the sake of simplicity let's say the file contains just integers, but they are unordered and not each integer needs to exist in the other file.
I thought about using an external sorting to sort each file, then read the first entry from each file, and use the following pseudologic:
(read_file() would return something that which you can compare, and call advance() on to shift the cursor to the next line)
left_file = read_file("file-1.txt")
right_file = read_file("file-2.txt")
while (there_are_entries_in_either_file):
if left_file == right_file:
# write match to output
left_file.advance()
right_file.advance()
elif left_file < right_file:
# if left is smaller then we can skip that entry
left_file.advance()
elif right_file < left_file:
# if right is smalelr then we can skip that entry
right_file.advance()
This would in theory work, but would limit me to a single thead, which would not be very fast.
The process with example data would look like (left column is file 1, right column is file 2)
1 1
2 5
3 6
5 7
- We read
1from both left and right, it matches so we advance both files - We read
2from left and5from right, left is smaller, so we discard the2and advance left - We read
3from left and still have the5from right, left is stills maller so we discard3and advance left - We read
5from left and still have the5from right, which now matches so we advance both - repeat...
Are there any algorithms that would allow me to match pairs in a multithreaded way?
Solution 1:[1]
The correct way of solving this is using MapReduce which will help to divide these huge files into smaller chunks and once you have the result, you merge them back into a single output.
If you are not looking to setup a cluster for this purpose , you could follow the same algorithm on a single node as well.
Rough pseudocode:
- Split both the files into
xparts. (say 1000 1GB files or maybe even smaller). - Write a logic that takes 1 part file from each of the 2 sets and tries to match them by your logic. If matches are there, store them in part_a1_b1_output. Also, make sure that if splitting leads to some metadata loss, try to find an alternate way to compute it.
- Once you have all the output files ready, do an n-way merge to get the final result.
But if you use an actual cluster, this can be rather quickly.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Phenomenal One |
