'Read vertex ai datasets in jupyter notebook

I am trying to create a python utility that will take dataset from vertex ai datasets and will generate statistics for that dataset. But I am unable to check the dataset using jupyter notebook. Is there any way out for this?



Solution 1:[1]

If I understand correctly, you want to use Vertex AI dataset inside Jupyter Notebook. I don't think that this is currently possible. You are able to export Vertex AI datasets to Google Cloud Storage in JSONL format:

Your dataset will be exported as a list of text items in JSONL format. Each row contains a Cloud Storage path, any label(s) assigned to that item, and a flag that indicates whether that item is in the training, validation, or test set.

At this moment, you can use BigQuery data inside Notebook using %%bigquery like it's mentioned in Visualizing BigQuery data in a Jupyter notebook. or use csv_read() from machine directory or GCS like it's showed in the How to read csv file in Google Cloud Platform jupyter notebook thread.

However, you can fill a Feature Request in Google Issue Tracker to add the possibility to use VertexAI dataset directly in the Jupyter Notebook which will be considered by the Google Vertex AI Team.

Solution 2:[2]

Please correct me if I am wrong, are you trying to access vertex ai dataset which is in you gcp project into the jupyter notebook ? If so, try below code and see if you can access dataset.

def list_datasets(project_id, compute_region, filter=None):
"""List all datasets."""
result = []
# [START automl_tables_list_datasets]
# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# filter = 'filter expression here'

from google.cloud import automl_v1beta1 as automl

client = automl.TablesClient(project=project_id, region=compute_region)
print('client:',client)
# List all the datasets available in the region by applying filter.
response = client.list_datasets(filter=filter)

print("List of datasets:")
for dataset in response:
    # Display the dataset information.
    print("Dataset name: {}".format(dataset.name))
    print("Dataset id: {}".format(dataset.name.split("/")[-1]))
    print("Dataset display name: {}".format(dataset.display_name))
    metadata = dataset.tables_dataset_metadata
    print(
        "Dataset primary table spec id: {}".format(
            metadata.primary_table_spec_id
        )
    )
    print(
        "Dataset target column spec id: {}".format(
            metadata.target_column_spec_id
        )
    )
    print(
        "Dataset target column spec id: {}".format(
            metadata.target_column_spec_id
        )
    )
    print(
        "Dataset weight column spec id: {}".format(
            metadata.weight_column_spec_id
        )
    )
    print(
        "Dataset ml use column spec id: {}".format(
            metadata.ml_use_column_spec_id
        )
    )
    print("Dataset example count: {}".format(dataset.example_count))
    print("Dataset create time: {}".format(dataset.create_time))
    print("\n")

    # [END automl_tables_list_datasets]
    result.append(dataset)

return result

you require to pass project_id and comupte_region while calling this function.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 PjoterS
Solution 2 tt0206