'What does DataAccessor do in tfx?
I'm reading the tfx tutorials, which all uses the DataAccessor to load data. The code looks something like this:
return data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size, label_key=_LABEL_KEY),
schema=schema).repeat()
This makes sense at a highlevel for the tutorials but I can't find relevant documentations when I dug deeper.
The tf_dataset_factory function take a tfxio.TensorFlowDatasetOptions argument and so from the args description I'm deducing that class has some type of effect similar to:
dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data
But it's not clear to me in what order this is applied (if it matters) and how I can manipulate the dataset in details. For example, I want to apply Dataset.cache() but it's not clear to me whether if applying cache() after the tf_dataset_factory makes sense.
Another example of what I don't understand is whether if DataAccessor has predefined distributed training support. i.e. do I need to multiply batch_size by strategy.num_replicas_in_sync? Because the distributed training tutorials makes a clear statement on doing that but the tfx tutorials don't even mention it, so to me it's a 50/50 whether if the num_replicas_in_sync is already compensated for.
Wondering if anyone else are in the same shoes or have better ideas?
Solution 1:[1]
Looking through the documentations, DataAccessor seems to be a utility wrapper around a tf.data.Dataset factory. The object that is returned is an instance of tf.data.Dataset - so any subsequent methods that apply to a normal Dataset object are valid here too.
I cant comment on the distributed training support and it is not documented anywhere as you rightly pointed out
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Manish Dash |
