'Proper way to add additional data to a tf.dataset using existing element information

I want to train a model that has two types of data, a multivariate time series part and a static (as in not changing over time) part corresponding to which of some 8,000 different sites the time series part comes from.

To create windows on the time series part, I am using the timeseries_dataset_from_array util:

train_ds = tf.keras.preprocessing.timeseries_dataset_from_array(
    input_data, targets, sequence_length=window_size, batch_size=256)

Now, within that data there is a label for the site in some column. What I want to do is use this site label to retrieve the static covariates specific to that site, which are simply in a Pandas dataframe small enough to fit in memory. I want to use the static covariates as a second input fed into a different part of the model, as the time series data and the static covariates are different shapes.

My current idea for an approach is to use dataset.map to create a function to simply retrieve the appropriate data from the dataframe using the site label as an index (the site label is numerical and the order of rows in the dataframe matches, so I simply need to use the site label as the row index), then return it as a second input. However, I have some questions and challenges:

First, how to extract that numerical value of the site label in the dataset for each example to grab the appropriate value. This seems to be tricky because it is hard to use tf tensors as primitive types. Especially since, unless wrapped in something like tf.py_function, map functions are automatically converted using autograph so you can't eval or numpy() tensors unless you wrap the function.

Second, related to the first, will I have to wrap my map function in tf.py_function in order to retrieve the values from the dataframe and add them to the data element being processed. If I can avoid having to use tf.py_function, presumably I will gain some performance.

To sum up: I want to create a map function on a tf.dataset that looks at a value in the current dataset element in order to index a pandas dataframe and add additional inputs to the element. The challenge is that you can't just look at the value of tensorflow tensors (outside of eager mode) however you please. So, how do?

EDIT: Sample data to illustrate the problem.

So, suppose I have static covariates which apply to each of several sites:

"X1","X2","X3","X4","X5","X6","X7","X8"
4.32133447741287e-06,6.20476163876063e-06,0.0212094904785247,0.370881311305713,0.370689916185094,0.083173334124699,0.229070897580682,0.719638056520948
0,0,0.00370161158332356,0.339504724358077,0.344646384393516,0.0517004067054517,0.759331739735653,0.297462461477932
0.00041030758990847,0.000628192603260548,0.0192545155969968,0.529130656702492,0.537133867461067,0.199873252245348,0.333683574561935,0.875549144318417

Where each row is a different site (e.g. sites 0, 1, and 2 here)

Simultaneously, I have a multivariate time series that looks like this:

,Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,id,labels
0,1.0,0.94482875,0.46432582,0.80286896,0.7936321,0.67850864,0.549642,0.31848758,0,3
1,2.0,0.9560909,0.4811706,0.79488665,0.79743224,0.7222551,0.57671803,0.31862718,0,3
2,3.0,0.93590724,0.48287416,0.7835187,0.7924177,0.71277523,0.6097479,0.32546812,0,3
3,4.0,0.92497957,0.44431886,0.7871566,0.79615116,0.74299884,0.59539235,0.310656,0,3
4,5.0,0.9314811,0.4567894,0.81035763,0.78996235,0.75572145,0.6089085,0.31769997,0,3
5,6.0,0.9362454,0.45195797,0.80348736,0.8113547,0.7565665,0.5617635,0.30980688,0,3
6,7.0,0.94843304,0.4684674,0.80212045,0.82646096,0.77068603,0.5671833,0.2966195,0,3

From which we draw the windows of inputs using the timeseries_dataset_from_array util. Each window looks like just some number of the rows from the multivariate time series above.

The second to last column is the site ID. What I wish to do, is for each input in the dataset generated by the timeseries_dataset_from_array, I want to grab the site id in the first row (I already filter out any windows that cross sites), and use the value of this number to index into the static covariates at the appropriate row (e.g., row 0 here), and append that row to the dataset element's features as a second distinct input to enter into the model.

I know how to do this using ordinary python code, which I can execute using the tf.py_function wrapper. What I am wondering, is whether or not it is possible to perform this operation using purely tensorflow functions so that I do not have to use this wrapper. Not using this wrapper means that this code doesn't run in eager mode, which means that not all python code works properly, and in particular means you can't evaluate a tensor using your usual .eval() or .numpy(), which is the main part I'm having trouble with. You can't directly use a tensor object as an index into a dataframe. I'm thinking that using tf.gather after converting the df to a tensor may work but I am unsure.

EDIT EDIT:

To clarify further,

If the time series is like this:

data, sitelabel 1, 0 2, 0 3, 1 4, 1

And the static covariates for each site are sitelabel, static covariate 0, 10 1, 20

Then, when timeseries_dataset_from_array processes the time series with a window size of 2 (after I filter cross-site windows out), I get the following two input elements:

[[1, 0], [2, 0]] [[3, 1], [4, 1]]

What I wish to do is grab the associated static covariates of the site label, and have that be a separate input, e.g. I want to have a dataset returned that looks like:

([[1, 0], [2, 0]], [10])

([[3, 1], [4, 1]], [20])

Or as a dictionary with the time sequence and the static covariates as the two members.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source