python - How to efficiently serialize a scikit-learn classifier -

- August 15, 2012

what's efficient way serialize scikit-learn classifier?

i'm using python's standard pickle module serialize text classifier, results in monstrously large pickle. serialized object can 100mb or more, seems excessive , takes while generate , store. i've done similar work weka, , equivalent serialized classifier couple of mbs.

is scikit-learn possibly caching training data, or other extraneous info, in pickle? if so, how can speed , reduce size of serialized scikit-learn classifiers?

classifier = pipeline([     ('vectorizer', countvectorizer(ngram_range=(1,4))),     ('tfidf', tfidftransformer()),     ('clf', onevsrestclassifier(linearsvc())), ])

for large text datasets, use hashing trick: replace tfidfvectorizer hashingvectorizer (potentially stacked tfidftransformer in pipeline): faster pickle won't have store vocabulary dict more discussed in question:

how can reduce memory usage of scikit-learn vectorizers?

Search This Blog

EXIT

python - How to efficiently serialize a scikit-learn classifier -

Comments

Post a Comment

Popular posts from this blog

c# - SelectList with Dictionary, add values to the Dictionary after it's assigned to SelectList -

mysqli - Php Mysqli_fetch_assoc Error : "Warning: Illegal string offset 'name' in" -

javascript - Chart.js - setting tooltip z-index -