'"The timestamp column must have valid timestamp entries." error when using `timestamp_split_column_name` arg in `AutoMLTabularTrainingJob.run`
From the docs it says that
The value of the key values of the key (the values in the column) must be in RFC 3339 date-time format, where time-offset = “Z” (e.g. 1985-04-12T23:20:50.52Z)
The dataset that I'm pointing to is a CSV in cloud storage, where the data is in the format suggested by the docs:
$ gsutil cat gs://my-data.csv | head | xsv select TS_SPLIT_COL
TS_SPLIT_COL
2021-01-18T00:00:00.00Z
2021-01-18T00:00:00.00Z
2021-01-04T00:00:00.00Z
2021-03-06T00:00:00.00Z
2021-01-15T00:00:00.00Z
2021-02-11T00:00:00.00Z
2021-02-05T00:00:00.00Z
2021-05-20T00:00:00.00Z
2021-01-05T00:00:00.00Z
But I receive a Training pipeline failed with error message: The timestamp column must have valid timestamp entries. error when I try to run a training job
EDIT: this should hopefully make it more reproducible
data: https://pastebin.com/qEDqvzX6
Code I'm running:
from google.cloud import aiplatform
PROJECT = "my-project"
DATASET_ID = "dataset-id" # points to CSV
aiplatform.init(project=PROJECT)
dataset = aiplatform.TabularDataset(DATASET_ID)
job = aiplatform.AutoMLTabularTrainingJob(
display_name="so-58454722",
optimization_prediction_type="classification",
optimization_objective="maximize-au-roc",
)
model = job.run(
dataset=dataset,
model_display_name="so-58454722",
target_column="Y",
training_fraction_split=0.8,
validation_fraction_split=0.1,
test_fraction_split=0.1,
timestamp_split_column_name="TS_SPLIT_COL",
)
Solution 1:[1]
Try this timestamp format instead:
2022-03-18T01:23:45.123456+00:00
It uses +00:00 instead of Z to specify timezone.
This change will eliminate the "The timestamp column must have valid timestamp entries." error
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | WuA |
