Data Analysis

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
df_general_cleaned <- read_csv("data/general_cleaned.csv")
Rows: 601 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): game1, genre
dbl (6): year, totalearn, offlineearn, percent, totallayer, tournament

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_historical_cleaned <- read_csv("data/historical_cleaned.csv")
Rows: 9731 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): game2
dbl  (3): earn, player, tournament
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Analysis

Earning per Year

Q1 asks the earning per year. To answer this question, I first extracted the year from the date column in the dataset. Then, I grouped the data by year and calculated the total earnings for each year.

df_historical_cleaned |>
  mutate(year_total = year(date)) |>
  group_by(year_total) |>
  summarise(yearsum = sum(earn)) 
# A tibble: 26 × 2
   year_total  yearsum
        <dbl>    <dbl>
 1       1998  153800 
 2       1999  285319.
 3       2000  769783.
 4       2001  940232.
 5       2002 1042500.
 6       2003 1952859.
 7       2004 3081180.
 8       2005 4418475.
 9       2006 5787328.
10       2007 7628095.
# ℹ 16 more rows

Average earning & tournaments per month

Q2(1) asks for the average earning per month. To answer this question, I first extracted the month from the date column in the dataset. Then, I grouped the data by month and calculated the average earnings for each month.

df_historical_cleaned |>
  mutate(month = substr(date, 6, 7)) |>
  group_by(month) |>
  summarise(e_month_avg = mean(earn))
# A tibble: 12 × 2
   month e_month_avg
   <chr>       <dbl>
 1 01        101054.
 2 02         94802.
 3 03        128704.
 4 04        141726.
 5 05        181406.
 6 06        137008.
 7 07        232659.
 8 08        297631.
 9 09        143031.
10 10        230862.
11 11        195119.
12 12        236115.

Q2(2) asks for the average tournaments per month. To answer this question, I first extracted the month from the date column in the dataset. Then, I grouped the data by month and calculated the average tournaments for each month.

df_historical_cleaned |>
  mutate(month = substr(date, 6, 7)) |>
  group_by(month) |>
  summarise(t_month_avg = mean(tournament))
# A tibble: 12 × 2
   month t_month_avg
   <chr>       <dbl>
 1 01           5.90
 2 02           6.45
 3 03           6.94
 4 04           7.12
 5 05           7.29
 6 06           6.41
 7 07           6.93
 8 08           6.08
 9 09           6.34
10 10           6.60
11 11           6.84
12 12           6.75

The ten most profitable e-sport games

Q3 asks for the ten most profitable e-sport games. I first arranged the dataset in descending order based on the total earnings, which allowing me to sort the games by their profitability. Then I selected the top ten entries from the sorted dataset to identify the most profitable e-sport games.

top_10_earners <- df_general_cleaned |>
  arrange(desc(totalearn)) 
head(top_10_earners,n = 10)
# A tibble: 10 × 8
   game1          year genre totalearn offlineearn percent totallayer tournament
   <chr>         <dbl> <chr>     <dbl>       <dbl>   <dbl>      <dbl>      <dbl>
 1 Dota 2         2013 Mult…    3.60e8  313554838.   0.870       4980       1934
 2 Fortnite       2017 Batt…    1.91e8   56299885.   0.295       9535       2291
 3 Counter-Stri…  2012 Firs…    1.62e8  105130836.   0.647      16518       7033
 4 League of Le…  2009 Mult…    1.14e8   98546688.   0.868       9588       3000
 5 Arena of Val…  2015 Mult…    9.87e7   94528965.   0.957       2231        179
 6 PLAYERUNKNOW…  2017 Batt…    8.73e7   60665002.   0.695       4263        341
 7 PLAYERUNKNOW…  2017 Batt…    6.42e7   49370329.   0.769       3967        631
 8 Rainbow Six …  2015 Firs…    4.81e7   37241404.   0.774       3070        568
 9 StarCraft II   2010 Stra…    4.27e7   33672991.   0.789       2280       7510
10 Rocket League  2015 Spor…    4.23e7   17816800.   0.422       6884       4542

The revenue of the ten most profitable e-sport games changed per years

Q4 asks how the revenue of the ten most profitable e-sport games changed over time. I first extracted the year from the date column in the dataset. Then, I filtered the data to include only the top 10 games identified in Q3. After that, I calculated the total earnings for each game in each year.

df_historical_cleaned |>
  mutate(year = year(date)) |>
  filter(game2 %in% c("Dota 2", "Fortnite", "Counter-Strike: Global Offensive", 
                      "League of Legends", "Arena of Valor", 
                      "PLAYERUNKNOWN'S BATTLEGROUNDS Mobile", 
                      "PLAYERUNKNOWN’S BATTLEGROUNDS", 
                      "Rainbow Six Siege", "StarCraft II", "Rocket League")) |>
  group_by(year, game2) |>
  summarise(earn_sum = sum(earn)) 
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 97 × 3
# Groups:   year [14]
    year game2                            earn_sum
   <dbl> <chr>                               <dbl>
 1  2010 League of Legends                  27943.
 2  2010 StarCraft II                      820839.
 3  2011 Dota 2                           1672768.
 4  2011 League of Legends                 527995.
 5  2011 StarCraft II                     3222428.
 6  2012 Counter-Strike: Global Offensive  222539.
 7  2012 Dota 2                           2087095.
 8  2012 League of Legends                4287337.
 9  2012 StarCraft II                     4237569.
10  2013 Counter-Strike: Global Offensive 1219039.
# ℹ 87 more rows

The average monthly revenue of the ten most profitable e-sport games

Q5 asks for the average monthly revenue of the ten most profitable e-sport games. I first extracted the month from the date column in the dataset. Then, I filtered the data to include only the top 10 games identified in Q3. After that, I grouped the data by month and game and calculated the average earnings for each game in each month.

df_historical_cleaned |>
  mutate(month = substr(date, 6, 7)) |>
  filter(game2 %in% c("Dota 2", "Fortnite", "Counter-Strike: Global Offensive", 
                      "League of Legends", "Arena of Valor", 
                      "PLAYERUNKNOWN'S BATTLEGROUNDS Mobile", 
                      "PLAYERUNKNOWN’S BATTLEGROUNDS", 
                      "Rainbow Six Siege", "StarCraft II", "Rocket League")) |>
  group_by(month, game2) |>
  summarise(month_earning = mean(earn)) 
`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.
# A tibble: 120 × 3
# Groups:   month [12]
   month game2                                month_earning
   <chr> <chr>                                        <dbl>
 1 01    Arena of Valor                             548178.
 2 01    Counter-Strike: Global Offensive           833441.
 3 01    Dota 2                                    1012297.
 4 01    Fortnite                                   447184.
 5 01    League of Legends                          167203.
 6 01    PLAYERUNKNOWN'S BATTLEGROUNDS Mobile      2230359.
 7 01    PLAYERUNKNOWN’S BATTLEGROUNDS              202073.
 8 01    Rainbow Six Siege                           85265.
 9 01    Rocket League                              172484.
10 01    StarCraft II                               144511.
# ℹ 110 more rows

The earn per tournament trend for the ten most profitable e-sport games through 2017 to 2023

Q6 asks about the earnings per tournament trend for the ten most profitable e-sport games from 2017 to 2023. I first filtered the dataset to include only the specified top 10 games. Then, I applied another filter to restrict the data to the years between 2017 and 2023. After that, I calculated the earnings per tournament for each game by dividing the total earnings by the number of tournaments.

df_historical_cleaned |>
  filter(game2 %in% c("Dota 2", "Fortnite", "Counter-Strike: Global Offensive", 
                      "League of Legends", "Arena of Valor", 
                      "PLAYERUNKNOWN'S BATTLEGROUNDS Mobile", 
                      "PLAYERUNKNOWN’S BATTLEGROUNDS", 
                      "Rainbow Six Siege", "StarCraft II", "Rocket League")) |>
  filter(substr(date, 1, 4) %in% c("2017", "2018", "2019", "2020", "2021", "2022", "2023")) |>
  group_by(date, game2) |>
  summarise(t_earning = earn/tournament)
`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.
# A tibble: 764 × 3
# Groups:   date [84]
   date       game2                            t_earning
   <date>     <chr>                                <dbl>
 1 2017-01-01 Counter-Strike: Global Offensive    84323.
 2 2017-01-01 Dota 2                             176639.
 3 2017-01-01 League of Legends                   11979.
 4 2017-01-01 Rainbow Six Siege                     100 
 5 2017-01-01 Rocket League                         223.
 6 2017-01-01 StarCraft II                         9723.
 7 2017-02-01 Counter-Strike: Global Offensive    13988.
 8 2017-02-01 Dota 2                              48536.
 9 2017-02-01 League of Legends                   20416.
10 2017-02-01 Rainbow Six Siege                  100000 
# ℹ 754 more rows

Plot

library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout

Earning per Year

# Create the plot and assign it to a variable
df_historical_cleaned |>
  mutate(year_total = year(date)) |>
  filter(year_total != 2024) |>
  group_by(year_total) |>
  summarise(yearsum = sum(earn) / 1000000) |>
  ggplot(aes(x = year_total, y = yearsum, fill = as.factor(year_total))) +
  geom_col() +
  labs(title = "Earning per Year (million)",
       x = "Year",
       y = "Earning") +
  theme_minimal(base_size = 15) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

Average Earning & Tournaments per month

df_historical_cleaned |>
  filter(date < "2024-01-01") |>
  mutate(month = substr(date, 6, 7)) |>
  group_by(month) |>
  summarise(e_month_avg = mean(earn) / 100000, .groups = 'drop') |>
  ggplot(aes(x = month, y = e_month_avg, fill = month)) +
  geom_col() +
  labs(title = "Average Earning per Month (Million)",
       x = "Month",
       y = "Average Earning") +
  theme_minimal(base_size = 15) +
  theme(legend.position = "none",
        plot.title = element_text(size = 15, hjust = 0.5),
        axis.text.x = element_text(size = 9),  
        axis.text.y = element_text(size = 9))

df_historical_cleaned |>
  filter(date < "2024-01-01") |>
  mutate(month = substr(date, 6, 7)) |>
  group_by(month) |>
  summarise(t_month_avg = mean(tournament), .groups = 'drop') |>
  ggplot(aes(x = month, y = t_month_avg, fill = month)) +
  geom_col() +
  labs(title = "Average Tournaments per Month",
       x = "Month",
       y = "Average Tournaments") +
  theme_minimal(base_size = 15) +
  theme(legend.position = "none",
        plot.title = element_text(size = 15, hjust = 0.5),
        axis.text.x = element_text(size = 9),  
        axis.text.y = element_text(size = 9))

The ten most profitable e-sport games

# Specifies the factor variable to be modified, then keeps the top 10 levels based on the earn, and makes the levels that are not in the top 10 combined into a new level called "Other".
df_general_cleaned |>
  mutate(game = fct_lump_n(f = game1, n = 10, w = totalearn, other_level = "Other")) |>
  group_by(game) |>
  summarise(earnings = sum(totalearn), .groups = 'drop') |>
  mutate(proportion = earnings / sum(earnings), 
         Game = fct_reorder(.f = game, .x = proportion, .fun = max, .desc = TRUE),
         Game = fct_relevel(.f = game, "Other", after = Inf)) |>
  ggplot(aes(x = "", y = proportion, fill = game)) + # 
  geom_col(width = 1, color = "black", alpha = 0.85) +
  theme_minimal(base_size = 15) +  
  theme(legend.position = "right", 
        legend.title = element_text(face = "bold",size = 10),
        axis.text = element_text(face = "bold",size = 9),
        plot.background = element_blank(),
        legend.text = element_text(size = 8),
        panel.grid = element_blank()) + 
  labs(title = "Percentage of Earnings by Game", 
       y = element_blank(), 
       x = element_blank(),
       fill = "Game Title") +
  geom_text(aes(label = ifelse(proportion > 0.05, paste0(round(proportion, 3) * 100, "%"), "")),
            position = position_stack(vjust = 0.5), size = 3) 

The revenue of the ten most profitable e-sport games changed per years

The ten most profitable e-sport games: Dota 2, Fortnite, Counter-Strike: Global Offensive, League of Legends, Arena of Valor, PLAYERUNKNOWN’S BATTLEGROUNDS Mobile, PLAYERUNKNOWN’S BATTLEGROUNDS, Rainbow Six Siege, StarCraft II, Rocket League

df_historical_cleaned |>
  mutate(year = year(date)) |>
  filter(year!= 2024) |>
  filter(game2 %in% c("Dota 2", "Fortnite", "Counter-Strike: Global Offensive", 
                      "League of Legends", "Arena of Valor", 
                      "PLAYERUNKNOWN'S BATTLEGROUNDS Mobile", 
                      "PLAYERUNKNOWN’S BATTLEGROUNDS", 
                      "Rainbow Six Siege", "StarCraft II", "Rocket League")) |>
  group_by(year, game2) |>
  summarise(earn_sum = sum(earn)/100000) |>
  ggplot(aes(x = year, y = earn_sum, color = game2)) +
  geom_line() +
  geom_point() +
  labs(title = "Revenue for the ten most profitable e-sport games over time(million)",
       x = "Year",
       y = "Revenue",
       color = "Game") +
  theme_bw() +
  theme(legend.position = "right",
        legend.text = element_text(size = 4),
        plot.title = element_text(size = 12))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

The average monthly revenue of the ten most profitable e-sport games

df_historical_cleaned |>
  mutate(month = substr(date, 6, 7)) |>
  filter(game2 %in% c("Dota 2", "Fortnite", "Counter-Strike: Global Offensive", 
                      "League of Legends", "Arena of Valor", 
                      "PLAYERUNKNOWN'S BATTLEGROUNDS Mobile", 
                      "PLAYERUNKNOWN’S BATTLEGROUNDS", 
                      "Rainbow Six Siege", "StarCraft II", "Rocket League")) |>
  group_by(month, game2) |>
  summarise(month_earning = mean(earn)/1000000) |>
  ggplot(aes(x = month, y = month_earning, fill = game2)) +
  geom_col() +
  facet_wrap(~game2) +
  labs(title = "The Monthly Earnings Trend for the Ten Most Profitable E-Sport Games (Million)",
       x = "Month",
       y = "Tournament Earnings",
       color = "Game") +
  theme_minimal(base_size = 15) +
  theme(axis.text.x = element_text(size = 6, angle = 45, hjust = 1),
        axis.text.y = element_text(size = 9),
        plot.title = element_text(size = 11, hjust = 0.5),
        strip.text = element_text(size = 5), 
        legend.position = "none")
`summarise()` has grouped output by 'month'. You can override using the
`.groups` argument.

The earn per tournament trend for the ten most profitable e-sport games through 2017 to 2023

df_historical_cleaned |>
  filter(game2 %in% c("Dota 2", "Fortnite", "Counter-Strike: Global Offensive", 
                      "League of Legends", "Arena of Valor", 
                      "PLAYERUNKNOWN'S BATTLEGROUNDS Mobile", 
                      "PLAYERUNKNOWN’S BATTLEGROUNDS", 
                      "Rainbow Six Siege", "StarCraft II", "Rocket League")) |>
  filter(substr(date, 1, 4) %in% c("2017", "2018", "2019", "2020", "2021", "2022", "2023")) |>
  group_by(date, game2) |>
  summarise(t_earning = earn/100000*tournament) |>
  ggplot(aes(x = date, y = t_earning, color = game2)) +
  geom_line() +
  facet_wrap(~game2) +
  labs(title = "The earn per tournament trend for the ten most profitable e-sport games(million)",
       x = "Date",
       y = "Tournamente earning",
       color = "Game") +
  theme_bw() +
  theme(axis.text.x = element_text(size = 6, angle = 45, hjust = 1),
        axis.text.y = element_text(size = 9),
        plot.title = element_text(size = 11, hjust = 0.5),
        strip.text = element_text(size = 5), 
        legend.position = "none")
`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.