'Feature crosses and embeddings with tensorflow transform
I am trying to understand tensorflow transform by re-implementing this using only tft functions. I am able to bucketize continuous features using something like
tft.apply_buckets(transformed['normalized_pickup_longitude'], bucket_boundaries=tf.constant([lat_lon_buckets]))
The next step would be to create a feature cross that represents the origin of the trip, but I'm not finding any tft functions that do this. Do I have to create a cartesian product by multiplying these buckets, and then one hot encode them to create these feature crosses? And what do I do when it comes time to create an embedding to try and learn lower dimensional representations of these during training?
Is the idea that we don't do this in tft because the paradigm is to preprocess using tft to a certain extent, store to disk without any sparsity, and wherever there is sparsity, create it at training time as part of a preprocessing layer in the model so that we handle it in memory?
Thank you!
Update May 19th, 2022: Through some headbanging, I have come up with a heuristic which may be useful. In the preprocessing_fn which gets passed to AnalyzeAndTransformDataset, you can only do stateless actions. If they are stateful, they must be called via tft functions. For example, you cannot use tf.keras.layers.Normalization because it's going to set some variables to persist the mean and variance of the data, but you can use tf.keras.layers.experimental.preprocessing.HashedCrossing because the hashing algorithm doesn't maintain any state. If you follow these rules while using tft, you get two benefits:
- A reproducible pipeline (it persists a reusable transformation layer)
- Management of schema related artifacts (it persists the original and transformed schema)
I'm not sure which of these two benefits is greater. For example, with this, you can write functions that abstract the schema from you, so you can focus on feature engineering at a very granular level. This is huge in my opinion. Now my setup is a tft pipeline, which produces the inputs into the model. In the model, I have the trainable layers which are taking in the features fully engineered at that point, the rest of the features go into keras preprocessing layers (for example, to avoid storing sparse tensors on disk) after which they integrate into the rest of the model.
In any case, I hope someone more knowledgeable will shed some light on the rules of tft and if I'm guessing correctly about this.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
