python - How to efficiently serialize a scikit-learn classifier -
what's efficient way serialize scikit-learn classifier?
i'm using python's standard pickle module serialize text classifier, results in monstrously large pickle. serialized object can 100mb or more, seems excessive , takes while generate , store. i've done similar work weka, , equivalent serialized classifier couple of mbs.
is scikit-learn possibly caching training data, or other extraneous info, in pickle? if so, how can speed , reduce size of serialized scikit-learn classifiers?
classifier = pipeline([ ('vectorizer', countvectorizer(ngram_range=(1,4))), ('tfidf', tfidftransformer()), ('clf', onevsrestclassifier(linearsvc())), ])
for large text datasets, use hashing trick: replace tfidfvectorizer
hashingvectorizer
(potentially stacked tfidftransformer
in pipeline): faster pickle won't have store vocabulary dict more discussed in question:
Comments
Post a Comment