BGG data analysis entries

BGG database: Introduction

The entries in this section analyze different aspects of the Boardgamegeek (BGG) database. I was inspired by Dinesh Vatavani's analysis (http://dvatvani.github.io/BGG-Analysis-Part-1.html). I wanted to go further and it was a way to practice my machine learning skills in Python. I’m also motivated in understanding players, what they like and how they think. This could hopefully help me design better and more popular boardgames.

This section might be a little dry, but I think it is important to understand what is to come. I will cover three things: What is the BGG database, what we must keep in mind while interpreting the data and how I did the analysis. So here we go!

What is in the BGG database?

There is a lot of information on the BGG website. As of January 7th 2020, there are 112 758 entries in the database, of which 91 699 are boardgames and 21 059 are expansions. The information we can extract from the data base is:

  • id: The unique id of the entry. Each entry is tagged by a different integer in the database.

  • type: Boardgame or expansion

  • name: The title of the entry

  • yearpublished: The year the entry was published

  • minplayers: The minimum number of players required to play the entry

  • maxplayers: The maximum number of players that can play the entry

  • playingtime: The duration of the game (Not quite sure how it differs from maxplaytime)

  • minplaytime: The minimum duration of the entry

  • maxplaytime: The maximum duration of the entry

  • minage: The minimum age to understand the entry and be able to play it

  • users_rated: The number of users who gave a rating to the entry

  • average_rating: The average of the user ratings of the entry

  • bayes_average_rating: The bayesian-corrected average rating, that adds a number (~750 for boardgames) of fictive rating (~5.5) to avoid outliers when the number of ratings is small. Note that when the number of rating is less than 30 (i.e. statistically significant), the Bayesian average is not defined.

  • total_owners: The number of users owning the entry

  • total_traders: The number of users who traded the entry

  • total_wanters: The number of users looking to buy/trade for the entry

  • total_wishers: The number of users desiring/interested in the entry

  • total_comments: The number of comments about the entry

  • total_weights: The number of users who evaluated the complexity weight of the entry

  • average_weight: The average complexity weight of the game, as evaluated by users

  • types: The type of the entry: Abstract, Boardgame, CGS, Children game, Family game, Party game, RPG item, Strategy game, Thematic and/or Wargames. CGS stands for Customizable Game System. The types Boardgame, Family game and RPG seems to be deprecated as they were not very informative. These are more less "clusters" of tastes or player types or market segments. Not to be mixed with 'type' which is either boardgame or expansion.

  • categories: This is a list of various themes, but it also includes some of the types

  • mechanics: The list of game mechanics present in the game

  • designers: The designers of the entry

  • artists: The artists who illustrated the entry

  • publishers: The publishers who released the entry

  • family: This is a list of tags that do not fall into specific categories. They indicate if a game was crowdsourced, if it relates to a movie or tv series, if the game is a Legacy, if it has particular components, etc.

  • expansions: The expansions related to the game

Please note that there may be more items (such as language independence, honors). These are found on the webpages, but do not seem to be accessible through the database. In any case, this list will be updated periodically as I include more information in the database.

What to keep in mind

Interpreting data is more difficult than it seems. Data that seems to have one meaning can often be interpreted differently. We must also be wary not to confound correlation with causation. Two variables that change together are not necessarily linked together. But the most important is to always be aware of the underlying assumptions in the data. Here is a list of a few points to bear in mind:

  • The average BGG user is not the average player. It is safe to assume the average BGG user is more passionate about boardgames than the average player. It means they are more literate game-wise. They may like more complex games because of that and rate them accordingly this introduces a bias in the evaluations (number and value) compared to real world.

  • Not all games played are rated. A lot of game are played in the real world, most of which are not rated on BGG. It might still be that the ratings on BGG is a good sampling of the real world, but it is unlikely.

  • The rating does is independent of the number of plays. Let’s say you play a game a single time and rate it on BGG. Another player plays the same game 10 times and rate it on BGG. Which rating has more weight? On BGG, both are equal, although arguably the second one is more informed.

  • Popularity of a game on BGG may not reflect its sales. We can expect a correlation between two, although the sales might be more reflected in the number of ratings than the ratings themselves. Some games that are well-known best-sellers, such as Monopoly, do not fare well on BGG. Children games are famously misrepresented on BGG. We can also expect games based on licence, such as Transformers, might be sell more than they are rated.

  • Preferences are idiosyncratic. Each of us has a different tastes in food, music, movies… and games. This raises the question of what does it mean when we say a game is good? Is it well-designed, well-produced? Is it to our taste? To most people’s taste? Was it well marketed, so we desire it and therefore like it because of that? Is it because it won prizes? Then how were those prizes awarded? It is difficult to answer those questions with certainty. But it is good to keep them in mind when trying to understand what is going on.

  • The list of mechanics has been changed recently. There was a recent overhaul of the mechanics on BGG based on the work of Building Blocks of Tabletop Game Design by Engelstein and Shalev. This means that some games in the database might not have been updated with the new mechanics. The new list of mechanics is quite long with nearly 200 mechanics. This is both useful and problematic. It is useful since it gives information about the game and understand on the mechanics (which are more or less game design patterns for boardgames) are used. It is problematic since the list is so long that it is difficult to draw conclusions with that many variables. Players are definitely not looking at games using that many mechanics, so it is more an academic point of view than a pragmatic one. And such a list can never be exhaustive, since new mechanics can always be invented.

  • The types and categories may overlap. Types are like genres. I think they are useful in providing a broad idea of the game’s experience. Categories are more or less themes, but it is ill-defined. It might be more useful to see types (and categories) as clusters of games. Let’s say we put all games as dots on chart according to the aesthetic experience they offer. Some games are similar and will be closed to one another. Well, genre are those clusters of games. Those clusters will not be sharply defined, as they are games of all sorts.

  • The database is not complete. While all the variables presented in the section above can be found in an entry, a lot of entries are only partially filled. The analysis is therefore only as good as the data users put in the database.

How I did it

I used Python in this project to do about everything. I decided to work in the Jupyter Notebook (JN) to have everything at the same place. I modify Dinesh Vatavani's code to scrape the BGG database and created a JN for that . The scraping was done using the scrapy library. The scraping is done in two steps. First, the browse page of Boardgamegeek is accessed to extract the list of games. Next the BGG’S XMLAPI2 is used to access the database and extract the information. It seems that the information accessible though the XMLAPI2 is different from what can be accessed directly on the webpages. At some point, I might have to scrape the webpages directly, but in the meanwhile, there is plenty to do with the information of the database.

I did the more advanced analysis with the scikit-learn machine learning library. The graphs were done using the matplotlib library. All this was done in the Anaconda development environment. Once I’ve cleaned up and documented the code, I’ll try to make each notebook available in the corresponding section.

Bryan BurgoyneComment