'Reading a pickle file in a cloud Jupyter instance from a GCP stream (SList)

I am working with some large data in Google Cloud Platform storage, using a Jupyterlab notebook in GCP Vertex AI Workbench in order to avoid local storage and data transfer.

Some of my problems are solved by using gcloud pipes to run some useful operations in the style of Linux command lines. For example:

s_path_final = 'gs://bucket_name/filename.txt'
s_pattern = 'search_target_text'
!gsutil cp {s_path_final} - | egrep -m 1 '{s_pattern}'

finds the first occurrence of the search text in the text file as desired.

What isn't working is reading a Python pickle file streaming from the GCP bucket. For example,

import io
s_stream_out = !gsutil cp {GS_path_to_pickle} -
df = pd.read_pickle(io.StringIO(s_stream_out.n))

errors with message a bytes-like object is required, not 'str'.

s_stream_out seems to be an object of type SList (cf. https://gist.github.com/parente/b6ee0efe141822dfa18b6feeda0a45e5) that I don't know what to do with. Is there a way to reassemble it appropriately? Simple-minded solutions like running a string join on it didn't help.

I don't really understand pickle, I'm afraid, but I gather it's a sort of serialized format for saving Python objects, so in the best case, a solution to all this would allow some kind of looping through its serial structure and pulling the items one-by-one directly back into Python memory, without trying to save or re-create the whole pickle file locally or in memory.



Solution 1:[1]

I suspect that you're going to need to use a Google Client Library directly.

Here's a Python code sample to stream a download to a file|stream that should meet your needs.

I'm unfamiliar with Jupyter|iPython but I suspect that its String lists are only suitable for non-binary data. This is supported by the error message you're receiving too.

I think you could pickle.load the file_obj that's created in the sample.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DazWilkin