'How to search a binary file for a signature offset
This is my example so far:
import io
import mmap
import os
import time
# 20 MB test file
filename = "random.bin"
if not os.path.isfile(filename):
with open(filename, "wb") as f:
for _ in range(20):
f.write(os.urandom(1_000_000))
signature = b"\x01\x02\x03\x04"
print("Method 1:")
start_time = time.time()
offsets = []
with open(filename, "rb") as f:
buf = b"\x00" + f.read(len(signature) - 1)
for offset, byte in enumerate(iter(lambda: f.read(1), b"")):
buf = buf[1:] + byte
if buf == signature:
offsets.append(offset)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
print("Method 2:")
start_time = time.time()
offsets = []
with open(filename, "rb") as f, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
for offset in range(len(mm)):
if mm[offset : offset + len(signature)] == signature:
offsets.append(offset)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
print("Method 3:")
start_time = time.time()
offsets = []
with open(filename, "rb") as f:
offset = 0
chunk = f.read(len(signature) - 1)
while True:
chunk = chunk[-(len(signature) - 1) :] + f.read1(io.DEFAULT_BUFFER_SIZE)
if len(chunk) < len(signature):
# EOF
break
for i in range(len(chunk) - (len(signature) - 1)):
if chunk[i : i + len(signature)] == signature:
offsets.append(offset + i)
offset += len(chunk) - (len(signature) - 1)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
I'm searching through a 20 MB test file, looking for signature of 4 bytes. Switching to method 2 with mmap.mmap saved 50% of runtime, but it's still really slow. Especially since my actual target file will be between 1 and 10 GB. (Which is why I don't just load the whole file into memory, first.) It is orders of magnitude slower than md5sum random.bin.
Edit: I've added another method, which doesn't use f.read(1), but reads chunks of io.DEFAULT_BUFFER_SIZE and also uses read1 to prevent any blocking. But it's still not faster.
Solution 1:[1]
Reading the file wasn't the issue. Using chunks instead of read(1) should definitely be fast enough. However, afterwards, it's apparently a bad idea to iterate the bytes with slicing. bytes.find is much faster. I don't yet understand why, but I've posted a new question How is str.find so fast? about that.
Here is a simple solution - a 4th method for the example code from above - which uses bytes.find:
print("Method 4:")
start_time = time.time()
offsets = []
def _find(haystack, needle, start=0, offset=0):
while True:
position = haystack.find(needle, start)
if position < 0:
return
start = position + 1
yield position + offset
with open(filename, "rb") as f:
offset = 0
chunk = f.read(len(signature) - 1)
while True:
chunk = chunk[-(len(signature) - 1) :] + f.read1(io.DEFAULT_BUFFER_SIZE)
if len(chunk) < len(signature):
# EOF
break
offsets.extend(_find(chunk, signature, offset=offset))
offset += len(chunk) - (len(signature) - 1)
print(f"{time.time() - start_time:.2f} seconds")
print(offsets)
It produces the following time measurements:
Method 1:
8.83 seconds
[20971596, 20971686]
Method 2:
5.69 seconds
[20971596, 20971686]
Method 3:
5.76 seconds
[20971596, 20971686]
Method 4:
0.02 seconds
[20971596, 20971686]
So yeah, method 4 is clearly the winner. ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
