Skip to main content
  1. Projects/

Lego Set Sales

·918 words·5 mins

Maven Analytics recently hosted a competition to analyze a dataset of Lego releases stretching back to 1970. I’m not a Lego fanatic by any means, but it’s a fandom I’m more than happy to watch from afar, whether they’re stretching their bricks to absolute limit or setting up their ideal Lego studio. One video I’d highly recommend, and which has some particular resonance for this project, would be Slow Start’s documentary about Lego’s Bionicle line and how it possibly single-handedly saved the whole company:

Maven tasked us with creating an interactive dashboard, but since I’m not competing, I figure I’d instead just answer the research questions using Python and Excel. The questions are as follows:

  • How many LEGO sets have been released since 1970? Is there a noticable trend?
  • Is there a relationship between the price of a set and its number of pieces?
  • Which has been the most popular theme in each decade?
  • Are LEGO minifigures most closely tied to licensed sets?

Exploring the Data #

Like more datasets from Maven, the data is already in a pretty usable form. For the sake of completeness, I put together this data dictionary:

  • set_id: Official LEGO item number
  • name: Name of the LEGO set
  • year: Release year
  • theme: LEGO theme the set belongs to
  • subtheme: Subtheme within the theme
  • themeGroup: Overall group the theme belongs to
  • category: Type of set
  • pieces: Number of pieces in the set
  • minifigs: Number of mini figures included in the set
  • agerange_min: Minimum age recommended
  • US_retailPrice: US retail price at launch
  • bricksetURL: URL for the set on brickset.com
  • thumbnailURL: Small image of the set
  • imageURL: Full size image of the set

I set up a Jupyter notebook and put in the following commands to make sure that all the data is present.

import pandas as pd
import numpy as np

df = pd.read_csv("lego_sets.csv")

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18457 entries, 0 to 18456
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   set_id          18457 non-null  object 
 1   name            18457 non-null  object 
 2   year            18457 non-null  int64  
 3   theme           18457 non-null  object 
 4   subtheme        14901 non-null  object 
 5   themeGroup      18455 non-null  object 
 6   category        18457 non-null  object 
 7   pieces          14533 non-null  float64
 8   minifigs        8399 non-null   float64
 9   agerange_min    6787 non-null   float64
 10  US_retailPrice  6982 non-null   float64
 11  bricksetURL     18457 non-null  object 
 12  thumbnailURL    17451 non-null  object 
 13  imageURL        17451 non-null  object 
dtypes: float64(4), int64(1), object(9)
memory usage: 2.0+ MB

Research Questions #

How many LEGO sets have been released since 1970? Is there a noticable trend? #

According to the data, LEGO has released a total of 18,457 sets since 1970. These can be further broken down by category:

  • 12,757 Normal
  • 2,832 Gear
  • 1,094 Other
  • 631 Book
  • 578 Collection
  • 501 Extended
  • 64 Random

Releases By Year

Obviously the Normal sets far outnumber all the others combined. More interestingly, none of the other categories records more than a blip until 1998, when Gear begins taking off. It has been the second-best-selling category ever since with a brief exception between 2003 and 2005, when Collection took over. This tracks with the documentary above, which states that Lego had to make big pivots in the late 90s in order to stay relevant to the changing toy market.

Is there a relationship between the price of a set and its number of pieces? #

The correlation between retail price and the number of pieces is 0.867, which is indeed a very strong correlation.

As a sidenote, it’s interesting how low the correlation between year and retail price is; wouldn’t inflation track pretty well with yearly data? This result might be explained by the fact that reliable data on retail price doesn’t appear until 1996. There’s been less than 100% cumulative inflation since then, as opposed to the nearly 700% cumulative inflation since 1970. There’s also the possibility that the visible effects of inflation are diluted in the dataset due to LEGO offering a wider variety of products at different price points.

df.corr(method='pearson', numeric_only=True)
                    year    pieces  minifigs  agerange_min  US_retailPrice
year            1.000000  0.161708  0.045217      0.163378        0.129770
pieces          0.161708  1.000000  0.482633      0.636915        0.867433
minifigs        0.045217  0.482633  1.000000      0.346101        0.535155
agerange_min    0.163378  0.636915  0.346101      1.000000        0.587281
US_retailPrice  0.129770  0.867433  0.535155      0.587281        1.000000
df_retail = df[df["year"] > 1995]
df_retail.corr(method='pearson', numeric_only=True)
                    year    pieces  minifigs  agerange_min  US_retailPrice
year            1.000000  0.144281  0.049893      0.155055        0.129519
pieces          0.144281  1.000000  0.499715      0.637391        0.867426
minifigs        0.049893  0.499715  1.000000      0.344927        0.535155
agerange_min    0.155055  0.637391  0.344927      1.000000        0.587344
US_retailPrice  0.129519  0.867426  0.535155      0.587344        1.000000

There are several ways to convert years to decades, but I chose to use Excel to set up a new column called decade and input the following formula:

=(ROUNDDOWN(C2*0.1, 0))*10

Using a couple Pivot Tables, I was able to get the following:

title

Are LEGO minifigures most closely tied to licensed sets? #

The dataset indicates a very strong link between minifigures and licensed sets; not only do licensed sets have more minifigures than any other theme, but the the percentage of all minifigures that are found in licensed sets has risen from a little under 12% in 1999 to nearly half in 2022 and 2023. This makes sense, of course; what would a Hogwarts set be without Harry Potter, or a Death Star set without Darth Vader?

title

Further Research #

I’d like to return to this dataset soon to set up an interactive dashboard, perhaps with Tableau. I’d also like to dig in to the specfic years in which the original Bionicles line was running to gain further insights about the era.