A collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained Vagrant VM or EC2 AMI that you can deploy yourself.

It's essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface. For full documentation, see http://www.datasciencetoolkit.org/developerdocs.

Like data? Check out my Data Source Handbook from O'Reilly:

Follow me on Twitter

Version 0.50 - May 19th 2013


The Data Science Toolkit was assembled by Pete Warden and the source code is available at http://github.com/petewarden/dstk

Country boundaries by Thematic Mapping.

Contains Ordnance Survey data © Crown copyright and database right 2010.

Irish boundaries by Ben Raue.

New Zealand boundaries from Statistics NZ.

Worldwide states and provinces from Natural Earth.

US neighborhood boundaries provided by Zillow under a CC-SLA license.

This product includes GeoLite data created by MaxMind, available from http://www.maxmind.com/.

The OpenStreetMap and PostGIS projects have also provided some fantastic tools.

Using geocoding code from GeoIQ and Schuyler Erle.

Uses the Ocropus project for OCR on images, and catdoc for parsing pre-XML Word and Excel documents

Uses the Hpricot library for parsing HTML.

The Boilerpipe library is used to recognize and extract the main story text from documents.

Uses my Ruby port of Eamon Daly and Jon Orwant's original GenderFromName Perl module to classify first names.

Uses street and place data from OpenStreetMap

Uses region and postal code data from GeoNames.

Incorporates the city-level TwoFishes geocoder written by David Blackman at Foursquare.

If you have any questions, comments, or suggestions, email Pete