python - How to efficiently serialize a scikit-learn classifier -


what's efficient way serialize scikit-learn classifier?

i'm using python's standard pickle module serialize text classifier, results in monstrously large pickle. serialized object can 100mb or more, seems excessive , takes while generate , store. i've done similar work weka, , equivalent serialized classifier couple of mbs.

is scikit-learn possibly caching training data, or other extraneous info, in pickle? if so, how can speed , reduce size of serialized scikit-learn classifiers?

classifier = pipeline([     ('vectorizer', countvectorizer(ngram_range=(1,4))),     ('tfidf', tfidftransformer()),     ('clf', onevsrestclassifier(linearsvc())), ]) 

for large text datasets, use hashing trick: replace tfidfvectorizer hashingvectorizer (potentially stacked tfidftransformer in pipeline): faster pickle won't have store vocabulary dict more discussed in question:

how can reduce memory usage of scikit-learn vectorizers?


Comments

Popular posts from this blog

c# - SelectList with Dictionary, add values to the Dictionary after it's assigned to SelectList -

how can i manage url using .htaccess in php? -

ios - I get the error Property '...' not found on object of type '...' -