In a clear and direct explanation David Robinson writes about bayesian A/B testing . His article is a must read for getting a good and concise introduction to the topic.
Continue reading “About bayesian A/B testing”
If you work with R and do any kind of visualization is impossible not to love ggplot2. It makes everything prettier than many of the competing alternatives, including everything that I did for years using MatLab in Industry and Academic settings. I’m really happy to read that version 2.0 has been announced.
Continue reading “New version of ggplot2 (2.0.0) with support for extensions.”
In the first part of this series I started the analysis on a interesting dataset drawn from a sample of the users from the crowd-source review service Yelp. This post presents the second part of the analysis.
Continue reading “Can the groups of unrealistically happy and chronically dissatisfied users be identified? A Yelp dataset study. Part II”
The famous crowd-source review company Yelp makes available a large collection of data known as the
Yelp Dataset Challenge.
This series of posts explores the dataset and raises a data-driven question and subsequent analysis.
Continue reading “Can the groups of unrealistically happy and chronically dissatisfied users be identified? A Yelp dataset study. Part I”
In his webpage, Guy Abel presents the R library migest. This library allows for the creation of beautiful circular migration plots. As the creator of the library explains:
The basic idea of the plot is to show simultaneously the relative size of estimated flows between regions. The origins and destinations of migrants are represented by the circle’s segments, where nearby regions are positioned close to each other. The size of the estimated flow is indicated by the width of the link at its bases and can be read using the tick marks (in millions) on the outside of the circle’s segments. The direction of the flow is encoded both by the origin colour and by the gap between link and circle segment at the destination.
As an example of such plots is the following one depicting migration flow between continents.
Migest relies on the package circlize. Looking at the documentation of circlize similar kind of graphics can be found, however they are not as beautiful and ploshed as the ones produced by migest.
I can definitely see many interesting applications to this kind of plots, as are evident in the many examples found scattered through the internet. Beyond depicting migration, this type of chart might be a valuable and beautiful tool to visualize any kind of flow between entities.
As we Data Scientists know, a good dataset can spark the interest to perform deep and insightful analysis, to go deep in order to draw insightful and useful conclusions, reveal unforeseen relations among data. In other words, good data can motivate and capture the interest of people, there resides its power and importance.
There are some great sources of interesting datasets. As with many things involving the internet, the problem is, as always, the signal to noise ratio. There is a miriad of sources, but, how to discern the good from the bad? is there a good reliable and varied repository?
Over the years I have found some great sources, but one of the most useful and amazing repositories is the appropiately named Awesome Public Datasets.
It is conveniently organized in categories, some of which are:
- Biology. Including several genomic and micro-array databases.
- Climate. Some organized by countries and other involving global data.
- Data Challenges. Links to famous data challenges such as the Netflix or Yelp datasets.
- Geology and GeoSpace.
- Government. Organized by country and state, each link is a rabbit hole leading to a universe of varied local data.
- Natural Language
- Social Science.
- Social Networks.
- …. and many more.
I will keep adding posts with interesting data repositories, but for now this great source is sufficient, probably for a few years.