python - How to efficiently serialize a scikit-learn classifier -


what's efficient way serialize scikit-learn classifier?

i'm using python's standard pickle module serialize text classifier, results in monstrously large pickle. serialized object can 100mb or more, seems excessive , takes while generate , store. i've done similar work weka, , equivalent serialized classifier couple of mbs.

is scikit-learn possibly caching training data, or other extraneous info, in pickle? if so, how can speed , reduce size of serialized scikit-learn classifiers?

classifier = pipeline([     ('vectorizer', countvectorizer(ngram_range=(1,4))),     ('tfidf', tfidftransformer()),     ('clf', onevsrestclassifier(linearsvc())), ]) 

for large text datasets, use hashing trick: replace tfidfvectorizer hashingvectorizer (potentially stacked tfidftransformer in pipeline): faster pickle won't have store vocabulary dict more discussed in question:

how can reduce memory usage of scikit-learn vectorizers?


Comments

Popular posts from this blog

javascript - Count length of each class -

What design pattern is this code in Javascript? -

hadoop - Restrict secondarynamenode to be installed and run on any other node in the cluster -