'How to convert a Table Client object to Pyspark Dataframe in an Azure Tables query?

I'm trying to convert a Table Client object in Azure to Pyspark Data Frame but It doesn't work.

I've tried:

from azure.data.tables import TableClient
table_name = "Tablename"
my_filter = "DateTimeUTC ge datetime'2022-02-28T00:00:00Z' and DateTimeUTC le datetime'2022-02-28T01:59:00Z'"

table_client = TableClient.from_connection_string(conn_str="DefaultEndpointsProtocol=https;
                                         AccountName=Accountname;AccountKey=key", table_name=table_name)

entities = table_client.query_entities(my_filter)


df = spark.read.option("multiline","true").json(entities)

But it didn't work. Even I can't calculate de length of entities with the error:

*AttributeError: 'ItemPaged' object has no attribute 'keys'*

When I print entities iterating over there, my data looks like:

{'PartitionKey': '10000', 'RowKey': '20220228091315', 'Acceleration': 0.0, 'Altitude': 971, 'BatteryVoltage': 13.35, 'DateTimeUTC': TablesEntityDatetime(2022, 2, 28, 9, 13, 15, tzinfo=datetime.timezone.utc)}
{'PartitionKey': '10000', 'RowKey': '20220228091820', 'Acceleration': 0.0, 'Altitude': 980, 'BatteryVoltage': 13.35, 'DateTimeUTC': TablesEntityDatetime(2022, 2, 28, 9, 18, 20, tzinfo=datetime.timezone.utc)}
.
.
.

I want to have a pyspark DF to apply common libraries, functions and methods.



Solution 1:[1]

If your table is not too big could you try:

df = spark.read.option("multiline","true").json(
  [dict(row) for row in entities]
)

It seems entities is an iterator of ItemPaged and spark is trying to read the keys of those objects like a dict to form a schema. Simply parsing them to dicts should be enough for it to work.

I'm also uncertain about whether spark can recognize TablesEntityDatetime objects. But let's do things one at a time.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 scr