'Does slicing a bytes object create a whole new copy of data in Python?

Say I have very large bytes object (after loading binary file) and I want to read parts by parts and advance the starting position until it meets the end. I use slicing to accomplish this. I'm worried that python will create completely new copy each time I ask for a slice instead of simply giving me the address of the memory pointing to the position I want.

Simple example:

data = Path("binary-file.dat").read_bytes()
total_length = len(data)
start_pos = 0

while start_pos < total_length:
   bytes_processed = decode_bytes(data[start_pos:])  # <---- ***
   start_pos += bytes_processed 

In the above example does python creates completely new copy of bytes object starting from the start_pos due to the slicing. If so what is the best way to avoid data copy and use just a pointer to pass to the relevant position of the bytes array.



Solution 1:[1]

Yes, slicing a bytes object does create a copy, at least as of CPython 3.9.12. The closest the documentation comes to admitting this is in the description of the bytes constructor:

In addition to the literal forms, bytes objects can be created in a number of other ways:

  • A zero-filled bytes object of a specified length: bytes(10)
  • From an iterable of integers: bytes(range(20))
  • Copying existing binary data via the buffer protocol: bytes(obj)

which suggests any creation of a bytes object creates a separate copy of the data. But since I had a hard time finding an explicit confirmation that slicing does the same, I resorted to an empirical test.

>>> b = b'\1' * 100_000_000
>>> qq = [b[1:] for _ in range(20)]

After executing the first line, memory usage of the python3 process in top was about 100 MB. The second executed after a considerable delay, making memory usage rise to the level of 2G. This seems pretty conclusive. PyPy 7.3.9 targetting Python 3.8 behaves largely the same; though of course, PyPy’s garbage collection is not as eager as CPython’s, so the memory is not freed as soon as the bytes objects become unreachable.

To avoid copying the underlying buffer, wrap your bytes in a memoryview and slice that:

>>> bm = memoryview(b)
>>> qq = [bm[1:] for _ in range(50)]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1