'"The timestamp column must have valid timestamp entries." error when using `timestamp_split_column_name` arg in `AutoMLTabularTrainingJob.run`

From the docs it says that

The value of the key values of the key (the values in the column) must be in RFC 3339 date-time format, where time-offset = “Z” (e.g. 1985-04-12T23:20:50.52Z)

The dataset that I'm pointing to is a CSV in cloud storage, where the data is in the format suggested by the docs:

$ gsutil cat gs://my-data.csv | head | xsv select TS_SPLIT_COL
TS_SPLIT_COL
2021-01-18T00:00:00.00Z
2021-01-18T00:00:00.00Z
2021-01-04T00:00:00.00Z
2021-03-06T00:00:00.00Z
2021-01-15T00:00:00.00Z
2021-02-11T00:00:00.00Z
2021-02-05T00:00:00.00Z
2021-05-20T00:00:00.00Z
2021-01-05T00:00:00.00Z

But I receive a Training pipeline failed with error message: The timestamp column must have valid timestamp entries. error when I try to run a training job

EDIT: this should hopefully make it more reproducible

data: https://pastebin.com/qEDqvzX6

Code I'm running:

from google.cloud import aiplatform

PROJECT = "my-project"
DATASET_ID = "dataset-id"  # points to CSV 

aiplatform.init(project=PROJECT)

dataset = aiplatform.TabularDataset(DATASET_ID)

job = aiplatform.AutoMLTabularTrainingJob(
    display_name="so-58454722",
    optimization_prediction_type="classification",
    optimization_objective="maximize-au-roc",
)

model = job.run(
    dataset=dataset,
    model_display_name="so-58454722",
    target_column="Y",
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    timestamp_split_column_name="TS_SPLIT_COL",
)


Solution 1:[1]

Try this timestamp format instead:

2022-03-18T01:23:45.123456+00:00

It uses +00:00 instead of Z to specify timezone.

This change will eliminate the "The timestamp column must have valid timestamp entries." error

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 WuA