Can the groups of unrealistically happy and chronically dissatisfied users be identified? A Yelp dataset study. Part I

The famous crowd-source review company Yelp makes available a large collection of data known as the
Yelp Dataset Challenge.
This series of posts explores the dataset and raises a data-driven question and subsequent analysis.


The Yelp dataset

This Yelp dataset is available as 5 different files and it includes:

  • 1.6M reviews and 500K tips by 366K users for 61K businesses
  • 481K business attributes, e.g., hours, parking availability, ambience.
  • Social network of 366K users for a total of 2.9M social edges.
  • Aggregated check-ins over time for each of the 61K businesses.

Distribution of stars given by users. Two anomalous groups.

From the collection of datasets made available by Yelp (named Review, Tip, Checkin, User and Business) the most interesting for me was the User dataset, which contains variables that relate directly to a diverse sample of Yelp users.

Used in a correct way this dataset could serve as a basis for an interesting analysis of the relevant underlying factors that describe a user’s behavior. It also has the potential to shed light on the particular characteristics shared among a certain subgroup of users.

After an initial exploration of the User and Business datasets, which will be explained in the following sections, I discovered a very interesting discrepancy between the distribution of scores received by each business (expressed in the form of arbitrary “stars”) and the distribution of stars given by users. This means that in the Business dataset every business has an average number of received stars, while in the User dataset each user has an average star number, which is calculated from the diverse reviews written by that particular user.

There were two very strange peaks showing an abnormally large percentage of users having either 5 or 1 stars as average. What that means is that, beyond the expected distribution of stars across the spectrum of users, the people belonging to these two groups either:

a) Found every single business terrible, the lowest kind of terrible (1 star) or
b) They found every single business worthy of the highest possible praise and awarded them the best rating (5 stars)

Why could this be of any importance?

The obvious question is that it is not realistic to walk through life and find every single business either completely terrible or incredibly amazing. It could respectively point either to a masked misanthropy or to an euphoric cognitive dissonance. More seriously, if we put this data in its real commercial context, to identify these two groups of users would be incredibly valuable for:

a) Avoid the skewness added by these two groups in complex data analysis.

b) Analyze important subgroups present in the data

c) Create specific target advertising or a directed commercial strategy to address the characteristics and need of each particular group.

The analysis presented in this report tries to answer the vernacular question: “Can unrealistically happy and terribly “killjoy” users be identified?”

Methods and Data


Distribution assessment

First let’s have a look at the distribution of the average stars by user in the User dataset.The following graph shows the density.


And we can also see how it differs from the distribution of stars obtained by each business in the BUSINESS dataset.


Large peaks at 1 and 5 are evident in the density distribution of the average User’s stars. Although it is also evident that the distribution is different from the business stars, I performed a non-parametric test (Wilcoxon rank-Sum test) in order to have a numeric official confirmation.

Once the initial scope is decided we can directly look at how the variables relate to each other.

Exploratory relationships among descriptive features

The following figure shows only a few of the available predictors (due to space constraints), the actual number of predictors is higher. The labels in the following plot were purposely not included.


Many of the variables have positive skewness (long right tails) so they were processed using a Log Transform.
Additionally, new descriptors were created combining existing specific counts, pooling them into a total count from a certain category, for example or summarizing the diverse types of votes and compliments into single features.

In the second part of this series I’ll continue with the data exploration and the construction of prediction models.

Continues on part II

Leave a Reply

Your email address will not be published. Required fields are marked *