Sparse Data with CatBoost

Jun 7, 2021

How to prevent Jupyter Notebook kernels from dying when using sparse data with CatBoost:

For my work, I’ve been dealing with text data and quality control. I’ve been trying to use a TF-IDF vectorized version of the text data, but every time I try to load the sparse data into the “fit()” method of CatBoostClassifier, my Jupyter Notebook kernel dies.

The simple fix that I couldn’t find anywhere except here (and this was easy to miss) was that instead of loading sparse data (scipy.sparse.spmatrix or a subclass thereof) directly into CatBoost’s “Pool” data handler or the “fit()” method directly, I should convert my sparse data into a Pandas sparse DataFrame, then load that sparse DataFrame into the fit() method of CatBoost. (I also had to make sure the validation dataset was a sparse data frame.)

Simple as that.

Sparse Data with CatBoost

Written by Thomas Fackrell