'How to persist patsy DesignInfo?

I'm working on an application that is a "predictive-model-as-a-service", structured as follows:

  • train a model offline
  • periodically upload model parameters to a "prediction server"
  • the prediction server takes as input a single observation, and outputs a prediction

I'm trying to use patsy, but running into the following problem: When a single prediction comes in, how do I convert it to the right shape such that it looks like a row of the training data?

The patsy documentation provides an example when the DesignInfo from the training data is available in memory: http://patsy.readthedocs.io/en/latest/library-developers.html#predictions

# offline model training
import patsy

data = {'animal': ['cat', 'cat', 'dog', 'raccoon'], 'cuteness': [3, 6, 10, 4]}
eq_string = "cuteness ~ animal"


dmats = patsy.dmatrices(eq_string,data)
design_info = dmats[1].design_info
train_model(dmats)


# online predictions
input_data = {'animal': ['raccoon']}

# if the DesignInfo were available, I could do this:
new_dmat = build_design_matrices([design_info], input_data)
make_prediction(new_dmat, trained_model)

And then the output:

[DesignMatrix with shape (1, 3)
   Intercept  animal[T.dog]  animal[T.raccoon]
           1              0                  1
   Terms:
     'Intercept' (column 0)
     'animal' (columns 1:3)]

Notice that this row is the same shape as the training data; it has a column for animal[T.dog]. In my application, I don't have a way to access the DesignInfo to build the DesignMatrix for the new data. Concretely, how would the prediction server know how many other categories of animal are in the training data and in what order?

I thought I could just pickle it but it turns out this isn't supported yet: https://github.com/pydata/patsy/issues/26

I could also simply persist the matrix columns as a string and rebuild the matrix from that online, but this seems a bit fragile.

Is there a good way to do this?



Solution 1:[1]

Assuming your goal is to be able to restart the server without retraining, it looks like your best option (until patsy implements pickling) would be to pickle data, eq_string and whatever parameters are calculated by train_model. Then upon restarting the server, you could unpickle data and eq_string and call dmats = patsy.dmatrices(eq_string,data) again. This should run pretty fast, since it's not really training a model, just preprocessing your data. Then you would also unpickle the parameters calculated by train_model (not shown in the question), and the server should be ready to make predictions for new inputs.

Note that if you are splitting this into client and server components, the server should do everything discussed above, and the client should just send it the input_data defined in your question. (The client doesn't ever need to see dmats or design_info.)

Solution 2:[2]

Is there any progress regarding this issue? I know this is something very much needed.

Github still contains that issue.

Perhaps something simple like this?

import h5py

def save_patsy(patsy_step, filename):
    """Save the coefficients of a linear model into a .h5 file."""
    with h5py.File(filename, 'w') as hf:
        hf.create_dataset("design_info",  data=patsy_step.design_info_)

def load_coefficients(patsy_step, filename):
    """Attach the saved coefficients to a linear model."""
    with h5py.File(filename, 'r') as hf:
        design_info = hf['design_info'][:]
    patsy_step.design_info_ = design_info


save_patsy(pipe['patsy'], "clf.h5")

Hower, still not working. But I think this is the first step.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matthias Fripp
Solution 2 Petr