'How to Read Append Blobs as DataFrames in Azure DataBricks

My batch processing pipeline in Azure has the following scenario: I am using the copy activity in Azure Data Factory to unzip thousands of zip files, stored in a blob storage container. These zip files are stored in a nested folder structure inside the container, e.g.

zipContainer/deviceA/component1/20220301.zip

The resulting unzipped files will be stored in another container, preserving the hierarchy in the sink's copy behavior option, e.g.

unzipContainer/deviceA/component1/20220301.zip/measurements_01.csv

I enabled the logging of the copy activity as:

enter image description here

And then provided the folder path to store the generated logs (in txt format), which have the following structure:

Timestamp Level OperationName OperationItem Message
2022-03-01 15:14:06.9880973 Info FileWrite "deviceA/component1/2022.zip/measurements_01.csv" "Complete writing file. File is successfully copied."

I want to read the content of these logs in an R notebook in Azure DataBricks, in order to get the complete paths for these csv files for processing. The command I used, read.df is part of SparkR library:

Logs <- read.df(log_path, source = "csv", header="true", delimiter=",")

The following exception is returned:

Exception: Incorrect Blob type, please use the correct Blob type to access a blob on the server. Expected BLOCK_BLOB, actual APPEND_BLOB.

The generated logs from the copy activity is of append blob type. read.df() can read block blobs without any issue.

From the above scenario, how can I read these logs successfully into my R session in DataBricks ?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source