'How to acquire tf.data.dataset's shape?

I know dataset has output_shapes, but it shows like below:

data_set: DatasetV1Adapter shapes: {item_id_hist: (?, ?), tags: (?, ?), client_platform: (?,), entrance: (?,), item_id: (?,), lable: (?,), mode: (?,), time: (?,), user_id: (?,)}, types: {item_id_hist: tf.int64, tags: tf.int64, client_platform: tf.string, entrance: tf.string, item_id: tf.int64, lable: tf.int64, mode: tf.int64, time: tf.int64, user_id: tf.int64}

How can I get the total number of my data?



Solution 1:[1]

Where the length is known you can call:

tf.data.experimental.cardinality(dataset)

but if this fails then, it's important to know that a TensorFlow Dataset is (in general) lazily evaluated so this means that in the general case we may need to iterate over every record before we can find the length of the dataset.

For example, assuming you have eager execution enabled and its a small 'toy' dataset that fits comfortably in memory you could just enumerate it into a new list and grab the last index (then add 1 because lists are zero-indexed):

dataset_length = [i for i,_ in enumerate(dataset)][-1] + 1

Of course this is inefficient at best and, for large datasets, will fail entirely because everything needs to fit into memory for the list. in such circumstances I can't see any alternative other than to iterate through the records keeping a manual count.

Solution 2:[2]

Code as below:

dataset_to_numpy = list(dataset.as_numpy_iterator())
shape = tf.shape(dataset_to_numpy)
print(shape)

It produces output like this:

tf.Tensor([1080   64   64    3], shape=(4,), dtype=int32)

It's simple to write the code, but it still costs time to iterate the dataset. For more info about tf.data.Dataset, check this link.

Solution 3:[3]

As of 4/15/2022 with the TF v2.8, you can get the results by using

dataset.cardinality().numpy()

ref: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cardinality

Solution 4:[4]

To see element shapes and types, print dataset elements directly instead of using as_numpy_iterator. - https://www.tensorflow.org/api_docs/python/tf/data/Dataset

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for element in dataset:
  print(element)

break from for loop to see the shape of any tensor

dataset = tf.data.Dataset.from_tensor_slices((X_s, y_s))
for element in dataset:
  print(element)
  break

Output here as two numpy arrays and shape of each is printed

(<tf.Tensor: shape=(13,), dtype=float32, numpy=
array([ 0.9521966 ,  0.68100524,  1.973123  ,  0.7639558 , -0.2563337 ,
        2.394438  , -1.0058318 ,  0.01544279, -0.69663054,  1.0873381 ,
       -2.2745786 , -0.71442884, -2.1488726 ], dtype=float32)>, <tf.Tensor: shape=(2,), dtype=float32, numpy=array([0., 1.], dtype=float32)>)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Stewart_R
Solution 2 Shivam Roy
Solution 3 Vincent Yuan
Solution 4 Alex Punnen