Large scale non-linear learning on a single CPU

03:30 PM - 03:55 PM on August 16, 2015, Room 705

Andreas Mueller

Audience level:


This talk presents several methods for learning non-linear models on a single machine, where the dataset does not fit into ram. It will cover the hashing trick, kernel approximations, neural networks, and extreme learning machines (random neural networks). This will be a fairly technical talk, showing off some of the lesser known out-of-core capabilities of scikit-learn.


In the days of the "big data" buzz, many people build data driven applications on clusters from the start. However, working with distributed computing is not only pricey, but also requires a large engineering effort and removes interactivity from the data exploration process. In this talk I will demonstrate how to learn powerful nonlinear models on a single machine, even with large data sets. This can be achieved using the partial_fit interface provided by scikit-learn, that implements stochastic updates. Together with stateless transformation of the data, such as hashing, kernel approximation and random projections, these allow incrementally building a model without the need to store all the data in memory, or even on disk.