'How to use custom function with tensorflow dataset API?
I am new to TensorFlow's tf.data.Dataset
and I am trying to use it on my data that I loaded with pandas dataframe as follows:
Load the input date (df_input):
id messages Label
0 11 I am not driving home 0
1 11 Please pick me up 1
2 103 The car already park 1
3 103 No need for ticket 0
4 104 I will buy a car 1
5 104 I will buy truck 1
And I do preprocess and apply text Vectorization as follows:
text_vectorizer = layers.TextVectorization(max_tokens=20, output_mode="int", output_sequence_length=6)
text_vectorizer.adapt(df_input.message.values.tolist())
def encode(texts):
encoded_texts = text_vectorizer(texts)
return encoded_texts.numpy()
train_data = encode(df_input.message.values) ## This the training data
train_label = tf.keras.utils.to_categorical(df_input.label.values, 2) ## This labels
Then I am using the preprocess data in the training model by using the TensorFlow tf.data.Dataset
function as follows:
train_dataset_df = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
)
My question is how I can transform the data in every training epoch by applying my custom function to the training data. I saw a usage example of performing the transformation via .map
function from here to this post:
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
My goal is to apply my custom function as follows (which reorders the words in text data):
def order_augment_sent(Sentence):
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return newSentence
train_dataset_ds = (
tf.data.Dataset.from_tensor_slices((train_data, train_label))
.shuffle(1000)
.batch(2)
.map(lambda x, y: (order_augment_sent(x), y))
)
But I am getting error as:
AttributeError: 'Tensor' object has no attribute 'split'
Or if I apply my other cutom function, I am getting as:
TypeError: To be compatible with tf.function, Python functions must return zero or more Tensors or ExtensionTypes or None values; in compilation of <function _tf_if_stmt.<locals>.aug_body at 0124f565>, found return value of type WarningException, which is not a Tensor or ExtensionType.
I am not sure how I can do this and I will appreciate it if you have any idea or solution to help me.
Solution 1:[1]
The parameters you get in your lambda function are token from the vectors so they are int. If you want to reorder the text data, you need to do it before the text_vectorizer.
So you should add the TextVectorization layer to your model so your map function will have the string and you can reorder the sentance before calling the TextVectorization.
Here is an almost working exemple, you just need to edit the order_augment_sent function with the code you need, I didn't know what kind of sorting you want to do, probably you will have to write a custom sort with numpy https://www.tensorflow.org/api_docs/python/tf/py_function
import tensorflow as tf
import numpy as np
train_data = ["I am not driving home", "Please pick me up", "The car already park", " No need for ticket", "I will buy a car", "I will buy truck"]
train_label = [0,1,1,0,1,1]
text_dataset = tf.data.Dataset.from_tensor_slices(train_data)
max_features = 5000 # Maximum vocab size.
max_len = 4 # Sequence length to pad the outputs to.
# Create the layer.
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len)
# Now that the vocab layer has been created, call `adapt` on the text-only
# dataset to create the vocabulary. You don't have to batch, but for large
# datasets this means we're not keeping spare copies of the dataset.
vectorize_layer.adapt(train_data)
# Create the model that uses the vectorize text layer
model = tf.keras.models.Sequential()
# Start by creating an explicit input layer. It needs to have a shape of
# (1,) (because we need to guarantee that there is exactly one string
# input per batch), and the dtype needs to be 'string'.
model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
# The first layer in our model is the vectorization layer. After this
# layer, we have a tensor of shape (batch_size, max_len) containing vocab
# indices.
model.add(vectorize_layer)
def apply_order_augment_sent(s):
Sentence = s.decode('utf-8')
words = Sentence.split(" ")
words.sort()
newSentence = " ".join(words)
return(newSentence)
def order_augment_sent(x: np.ndarray, y:np.ndarray):
new_x = []
for i in range(len(x)):
new_x.append(np.array([apply_order_augment_sent(x[i])]))
print('new', new_x, y)
return(new_x, y)
train_dataset_ds = tf.data.Dataset.from_tensor_slices((train_data, train_label))
train_dataset_ds = train_dataset_ds.shuffle(1000).batch(32)
train_dataset_ds = train_dataset_ds.map(lambda item1, item2: tf.numpy_function(
order_augment_sent, [item1, item2], [tf.string, tf.int32]))
list(train_dataset_ds.as_numpy_iterator())
model.predict(train_dataset_ds)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |