Building a data processing pipeline in Python

01:15 PM - 01:40 PM on August 15, 2015, Room 704

Joe Cabrera

Audience level:
intermediate
Watch:
http://youtu.be/AlkuzBbiKk0

Description

Recently, the growth of publicly available data has been enormous. Python has a number of libraries and tool to aid you in building your data processing pipeline. These tools include Celery, Requests, BeautifulSoup and SQL-Alchemy. When combined together you can build an efficient and scalable data processing pipeline.

Abstract

Recently, the growth of publicly available data has been enormous. In the best cases this data is available in nicely formatted datasets. However, many times this data is made available piece-wise through a collection of poorly formatted documents scattered across the web. In these cases it is necessary to build a data processing pipeline.

In many cases, this data is available online on a publicly accessible site with links to each document in the data collection. Sometimes this data is static and other times new documents are added frequently. In the case of new documents appearing frequently, it is necessary to construct a recurring job to grab the new documents as they become available in addition to your initial data ingestion. In my pipeline, Celery is used to orchestrate every element of the pipeline including document download, ingestion, and cleansing. There are even two separate tasks for ingesting initial data and ingesting newly updated documents. The Requests library is used to download all the documents. HTTP requests are built to respect rate-limiting guidelines and Python 3's concurrent futures are used to parallelize requests.

Once you have successfully implemented your document download task, you will need to perform some parsing to retrieve the necessary facts from the documents. Many times you can leverage existing tools such as BeautifulSoup to aid in document parsing but sometimes you may need to build your own. At this point you may want to either store the raw data to it's final destination, say a database or file-store or perform further data cleansing. In my pipeline, I store the raw data to it's final destination, a database and then kick off a scheduled task to cleanse my data. My data cleansing task cleans up and aggregates raw data to be re-inserted to the database in a different table.

As your demand for processing power grows, Celery will allow your task queue to be distributed across multiple worker nodes. To house your growing data, SQL-Alchemy will allow new tables to be easily added to your database. At some point in you will most likely need to shard your database and SQL-Alchemy has a basic sharding API.

Python has a number of libraries and tool to aid you in building your data processing pipeline. Combined together you can build an efficient and scalable data processing pipeline.