'Scikit learn pipelines. Does it make any sense?
A noob question.
As I understand, the pipeline library of scikit learn is a kind of automation helper, which brings the data through a defined data processing cycle. But in this case I don't see any sense in it.
Why can't I implement data preparation, model training, score estimation, etc. via functional or OOP programming in python? For me it seems much more agile and simple, you can control all inputs, adjust complex dynamic parameter grids, evaluate complex metrics, etc.
Can you tell me, why should anyone use sklearn.pipelines? Why does it exist?
Solution 1:[1]
Read this article which goes through an example of using scikit-learn's pipeline. Pipeline optimization is an example of the work you can save by using pipelines.
Solution 2:[2]
I have used pipelines recently for data exploration purposes. I wanted to random search different pipelines.
This could be at least one reason to use pipelines.
But you are right pipelines aren't verry useful for many other purposes.
Solution 3:[3]
Because plenty of things are already implemented in scikit-learn (standard scaler, one hot encoder...). Those things are very commonly used, and being able to use them in a simple way just including it in a pipeline is very convenient.
However, when it comes to complex metrics or transformations, you will need to implement yourself a custom pipeline stage as it will likely not be implemented natively.
The idea behind a pipeline is that once build, you have a single object to deal with, without needing to worry about preprocessing your data before using it:
my_pipeline.fit(raw_data)
my_pipeline.transform(other_raw_data)
Another advantage is that you can save and reload your pipeline to/from the disk, which makes it quite convenient for deployment.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | YScharf |
| Solution 2 | PushTheButton |
| Solution 3 |
