-
-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFECV is much slower than Sklearn's implementation #1047
Comments
hi @jc639 I would like to explain a bit about the performance differences between YB and Sklearn. YB model does in fact utilize Sklearn in order to employ the RFE functionality. However, our model wraps You can find the code for this visualizer here My response is mostly taken directly from a comment within our code. Additionally, although not pertinent to your question, the RFE model can be accessed via the If you would like to discuss this more, please drop us a comment. Also, maybe @bbengfort can provide a more nuisance answer if I didn't cover everything. |
@lwgray thanks for the reply, I thought this might be the case. As far as I can tell from the code the reason to separate the two is to compute the following:
Which allows the plotting of the std (a very good thing!), but might it make sense to have an option of a fast implementation which uses sklearn's RFECV at the expense of the std, but can still utilise the very nice visualisation with mean score. Also maybe I am missing something but couldn't the sk RFECV fit method be overwritten to not do this one line: Aware this may be a bit hacky, but something like this addition:
In the following:
5.3 s ± 582 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(5, 30)
5.55 s ± 843 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1min 10s ± 2.63 s per loop (mean ± std. dev. of 7 runs, 1 loop each) Again maybe I am missing something here... Happy to have a go at contributing if you think this might be a good idea - not done it before but willing to learn! |
@bbengfort can you chime in here?.... I don't know of any instance where we altered a sklearn estimator |
@jc639 first off, I have to say - this is an excellent issue - thank you so much for taking the time to write benchmarks, plot them, and dive into both the yb and sklearn code! It's a lot easier for us to dive into these things when we get such excellent write-ups and investigations from contributors! I have also been noticing a slow down in RFECV, and I do think it has to do with a divergence between the sklearn implementation and our internal implementation. You're right that we need the reimplementation in order to get data for the visualization that's not provided from the sklearn implementation. Perhaps we should open a scikit-learn issue with your proposed In the meantime, I'm going to create a PR to investigate your proposal and loop you into it so we can modify it together, does that sound ok? |
@bbengfort Thanks for the help. I agree that opening a PR with sklearn might be the way forward |
@bbengfort @lwgray sorry for the radio silence, been a bit busy this week. I have got some time this weekend and next week to have a look into this.
Agree this would be best approach, to save importing the presumably "non-public" methods from the the internal _rfe.py. |
@jc639 any word on this? It would be great if we could update the performance of RFECV before the next release. |
@bbengfort I have forked your branch, and made a few edits and re-ran the test notebook again. Speed is now comparable to Sklearn. Just PR into your fork's branch.
yellowbrick version: 1.1
328 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
326 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
|
I'm running the yellowbrick implementation of RFECV for 738 features and 2843 samples. While the sklearn implementation run in ~ 20 min, the yellowbrick is running for days (I'm still waiting the outcome). |
I am aware that yellowbrick is using RFE and CV separately to produce the visualiser but the approach is several times slower than sklearn's implementation of RFECV.
Running the following in a jupyter notebook:
yellowbrick version: 1.1
sklearn version: 0.22.1
1min 23s ± 8.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.73 s ± 430 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If this is unavoidable due to using CV separately to get the full scores, then it would be nice to note in the documentation, so that you could use sklearn's RFECV to drop the bottom ~50% of features before running the visualiser.
This got me interested so I did some digging into what might affect the difference between sklearn and yellowbricks's RFECV:
Ratio of time difference is fairly stable over number of observations.
As number of starting features increase YB becomes even slower relative to sklearn.
YB becomes slower with increasing number of folds too!
The text was updated successfully, but these errors were encountered: