I didn't quite understand the need for openscoring and pmml. If it's just a ques...

Vieira · on June 17, 2014

From what I understand they want to persist the trained model in a language independent way, so you can train the models with whatever language or framework you wish and then save it to a format that can be used by any other language or framework to classify unseen instances.

zephyrnh · on June 17, 2014

Exactly. bayesianhorse's approach is perfectly valid, but your response describes our reasoning for our approach.

gallamine · on June 17, 2014

As I'm working on a very similar problem right now, the difficulty is that to save the fitted sklearn model you have to pickle it (pickled decent size random forest are several megabytes). Then, at classification time, you have to import pickle, sklearn (and numpy), depickle the object, run the example through the classifier and extract the output. Perhaps the Openscoring model is more efficient?

ogrisel · on June 17, 2014

You can use `all_model_filenames = joblib.dump(model, filename)` after fit on your dev enviroment. joblib will store each numpy array in the model datastructure as an independent file and `all_model_filenames[0] == filename` refers to the file holding the main pickle structure.

Then on your prediction servers, ensure that you have a copy of `all_model_filenames` in the same folder. You can then load the model with `model = joblib.load(filenames[0], mmap_mode='r')`. This will make it possible to use shared memory (memory mapping) for the model parameters of a large random forest so that all the Gunicorn, Celery or Storm worker processes running on the same server will use the same memory pages, making it a very efficient way to deploy large models on RAM constrained servers.

You can even use docker to ship the model as part of a container and treat the model as binary software configuration.

bayesianhorse · on June 17, 2014

As I said, run a seperate service for this. That way you only have to load the model (or even train it) once per service process. That is one thing the Openscoring service also does...

If you are more familiar with Python than Java, like me, then that would be a more attractive option.