Mutable Ideas

Notes and ideas about Java, Scala, Big Data, NoSQL, Quality and Software Deploy

Where to Find Datasets to Learn Big Data & Data Science

Sometimes you just need data to learn how a algorithm works, to run a stress test or just to have a excuse to spin up several machines in a cluster and see how it crush the data. More often than not, it is incredibly hard to obtain data, and a few colleagues I’ve talked about had similar problem, so this post is a collection of links and references for datasets I know have been open source. Please contribute =)

List of Datasets


Introducing Kaggle Datasets At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It’s tough to access data. It’s tough to understand what’s in the data once you access it. We want to change this. That’s why we’ve created a home for high quality public datasets, Kaggle Datasets.

Yahoo Labs

Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers: Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

UCI Machine Learning Repository

We currently maintain 342 data sets as a service to the machine learning community.

Common Crawl Data

We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.

Awesome Public Datasets

This list of public data sources are collected and tidied from blogs, answers, and user responses.

Buenos Aires Data

Iniciativa de datos públicos y transparencia de la Ciudad Autónoma de Buenos Aires Public datasets from the City of Buenos Aires, Argentina

Datasets Generator

Create Your Own Dataset Consuming Twitter API

This is my own post Several tutorials have an assumption you own a data set. Often that is not the case […]. To comply with social networks Terms and Conditions you can’t publish your data sets, but you can create your own!

Event Data Generator

Introducing Eventism: The Demo Event Data Generator: we’re releasing eventsim to the world. This is a tool that I wrote internally to produce a stream of real looking (but fake) event data. We use this for development, testing, and demos.