Wonderful Datasets, the source of good Data Science


As we Data Scientists know, a good dataset can spark the interest to perform deep and insightful analysis, to go deep in order to draw insightful and useful conclusions, reveal unforeseen relations among data. In other words, good data can motivate and capture the interest of people, there resides its power and importance.

There are some great sources of interesting datasets. As with many things involving the internet, the problem is, as always, the signal to noise ratio. There is a miriad of sources, but, how to discern the good from the bad? is there a good reliable and varied repository?

Over the years I have found some great sources, but one of the most useful and amazing repositories is the appropiately named Awesome Public Datasets.

It is conveniently organized in categories, some of which are:

  • Biology. Including several genomic and micro-array databases.
  • Climate. Some organized by countries and other involving global data.
  • Data Challenges. Links to famous data challenges such as the Netflix or Yelp datasets.
  • Geology and GeoSpace.
  • Government. Organized by country and state, each link is a rabbit hole leading to a universe of varied local data.
  • Healthcare.
  • Natural Language
  • Social Science.
  • Social Networks.
  • …. and many more.

I will keep adding posts with interesting data repositories, but for now this great source is sufficient, probably for a few years.