Large scale non-linear learning on a single CPU

03:30 PM - 03:55 PM on August 16, 2015, Room 705

Andreas Mueller

Audience level:
advanced
Watch:
http://youtu.be/l43VIw5xhTg

Description

This talk presents several methods for learning non-linear models on a single machine, where the dataset does not fit into ram. It will cover the hashing trick, kernel approximations, neural networks, and extreme learning machines (random neural networks). This will be a fairly technical talk, showing off some of the lesser known out-of-core capabilities of scikit-learn.

Abstract

In the days of the "big data" buzz, many people build data driven applications on clusters from the start. However, working with distributed computing is not only pricey, but also requires a large engineering effort and removes interactivity from the data exploration process. In this talk I will demonstrate how to learn powerful nonlinear models on a single machine, even with large data sets. This can be achieved using the partial_fit interface provided by scikit-learn, that implements stochastic updates. Together with stateless transformation of the data, such as hashing, kernel approximation and random projections, these allow incrementally building a model without the need to store all the data in memory, or even on disk.