'One-hot encoding Tensorflow Strings

I have a list of strings as labels for training a neural network. Now I want to convert them via one_hot encoding so that I can use them for my tensorflow network. My input list looks like this:

 labels = ['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']

The requested outcome should be something like

 one_hot [0,1,0,2,0]

What is the easiest way to do this? Any help would be much appreciated.

Cheers, Andi



Solution 1:[1]

the desired outcome looks like LabelEncoder in sklearn, not like OneHotEncoder - in tf you need CategoryEncoder - BUT it is A preprocessing layer which encodes integer features.:

inp = layers.Input(shape=[X.shape[0]])
x0 = layers.CategoryEncoding(
          num_tokens=3, output_mode="multi_hot")(inp)

model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
    loss='categorical_crossentropy',
    metrics=[tf.keras.metrics.CategoricalCrossentropy()])
print(model.summary())
  • this part gets encoding of unique values... And you can make another branch in this model to input your initial vector & fit it according labels from this reference-branch (it is like join reference-table with fact-table in any database) -- here will be ensemble of referenced-data & your needed data & output...

pay attention to -- num_tokens=3, output_mode="multi_hot" -- are being given explicitly... AND numbers from class_names get apriory to model use, as is Feature Engineering - like this (in pd.DataFrame)

import numpy as np
import pandas as pd

d = {'transport_col':['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']}
dataset_df =  pd.DataFrame(data=d)

classes = dataset_df['transport_col'].unique().tolist()
print(f"Label classes: {classes}")

df= dataset_df['transport_col'].map(classes.index).copy()
print(df)

from manual example REF: Encode the categorical label into an integer. Details: This stage is necessary if your classification label is represented as a string. Note: Keras expected classification labels to be integers.

Solution 2:[2]

in another architecture, perhaps, you could use StringLookup

vocab= np.array(np.unique(labels))
inp = tf.keras.Input(shape= labels.shape[0], dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=vocab)(inp)

but labels are dependent vars usually, as opposed to features, and shouldn't be used at Input

Everything in keras.docs

possible FULL CODE:

import numpy as np
import pandas as pd
import keras

X = np.array([['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']])
vocab= np.unique(X)
print(vocab)

y=  np.array([[0,1,0,2,0]])

inp = layers.Input(shape=[X.shape[0]], dtype='string')
x0= tf.keras.layers.StringLookup(vocabulary=vocab, name='finish')(inp)    

model = keras.Model(inputs=[inp], outputs=[x0])
model.compile(optimizer= 'adam',
    loss='categorical_crossentropy',
    metrics=[tf.keras.metrics.categorical_crossentropy])
print(model.summary())

from tensorflow.keras import backend as K
for layerIndex, layer in enumerate(model.layers):
    print(layerIndex)
    func = K.function([model.get_layer(index=0).input], layer.output)
    layerOutput = func([X])  # input_data is a numpy array
    print(layerOutput)
    if layerIndex==1:   # the last layer here
      scale = lambda x: x - 1 
      print(scale(layerOutput))

res:

[[0 1 0 2 0]]

Solution 3:[3]

another possible Solution for your case - layers.TextVectorization

import numpy as np
import keras

input_array = np.atleast_2d(np.array(['"car"', '"pedestrian"', '"car"', '"truck"', '"car"']))
vocab= np.unique(input_array)

input_data = keras.Input(shape=(None,), dtype='string') 
layer = layers.TextVectorization( max_tokens=None, standardize=None, split=None, output_mode="int",  vocabulary=vocab) 

int_data = layer(input_data) 
model = keras.Model(inputs=input_data, outputs=int_data)

output_dataset = model.predict(input_array)
print(output_dataset) # starts from 2 ... probably [0, 1] somehow concerns binarization ?

scale = lambda x: x - 2 
print(scale(output_dataset))

result:

array([[0, 1, 0, 2, 0]])

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3