View on GitHub


High Performance NLP: Issues, Applications, Practice Algorithms

Tips for using Scikit-learn on very large datasets

Author: Kenneth Steimel


When constructing n-gram models, one often creates an extremely sparse feature value matrix. This is simply a result of patterns that occur throughout language: large numbers of elements occur very infrequently while a small number of elements occur with great frequency.
One could simplify things by naively lopping off some number of features: the frequencies in a large amount of text are likely to be approximately the same as the frequencies that naturally occur in other texts so these infrequent features probably don’t generalize well anyways. For example, one could only grab the top 1000 most frequent features and then use only those. However, there are more sophisticated methods that can be used like Chi-squared, PCA and SVD. Scikit-learn provides highly optimized, easily parallelizable versions of these algorithms. In addition, scikit-learn allows you to easily tune these algorithms to produce the best results in conjunction with the classifier being used. Therefore, it does make sense to keep the sparsity in there.
However, several issues emerge as a result.

The solution that works best for me is to use the scipy.sparse module to generate spaarse matricies. These can be utilized by scikit-learn directly. The format best for filling in values in the feature value matrix is scipy.sparse.lil_matrix. However, scikit-learn works best with objects of scipy.sparse.csr format. What works best for me is to fill in values using lil_matrix and then convert to csr using .tocsr().


Ideally, you wouldn’t want to recompute the features each time you trained an algorithm. This is even worse when dealing with a large amount of data. One of the most common and easiest ways to serialize the feature value matrix you’ve tirelessly created is by exporting the data structure you’ve created as a pickle. This is commonly done using the following syntax: import pickle obj = generateObject() //Pretend this function creates some complex object with open(‘myfile.pkl’, ‘rb’) as fp pickle.dump(obj, fp)

See the documentation for more information.

However, with very large objects (more than 4 GB), problems emerge where the protocol used cannot handle objects of thsi size. If your data is sparse, you can use one of the sparse object formats mentioned above. If this is still too large, you should specify the highest protocol parameter to pickle.load and pickle.dump.