'How to only load one portion of an AzureML tabular dataset (linked to Azure Blob Storage)

I have a DataSet defined in my AzureML workspace that is linked to an Azure Blob Storage csv file of 1.6Gb. This file contains timeseries information of around 10000 devices. So, I could've also created 10000 smaller files (since I use ADF for the transmission pipeline).

My question now is: is it possible to load a part of the AzureML DataSet in my python notebook or script instead of loading the entire file?
The only code I have now load the full file:

dataset = Dataset.get_by_name(workspace, name='devicetelemetry')
df = dataset.to_pandas_dataframe()

The only concept of partitions I found with regards to the AzureML datasets was around time series and partitioning of timestamps & dates. However, here I would love to partition per device, so I can very easily just do a load of all telemetry of a specific device.

Any pointers to docs or any suggestions? (I couldn't find any so far)

Thanks already



Solution 1:[1]

Tabular datasets supports only timeseries filters today. We are working on enabling generic filtering on tabular datasets.

In the meantime, the other option is to partition the files by deviceID and use FileDataset. Using FileDataset will enable you to mount the entire dataset on compute. You can then read files specific to one/few devices. The downside is you need to use pandas directly to read the files.

Solution 2:[2]

TabularDataset has introduced filtering (experimental as of now - Feb'21) upgraded since then.

ds.filter( ds['patient_id']==my_patient_id ).to_pandas_dataframe()

You also find methods like take which give you first n rows, random sample of desired size etc.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dharman
Solution 2 Maciej S.