Andrii Kondratok

Researching the specifics of product X from the Genesis company dataset, conducting a superficial analysis to provide advice on further steps for the product


PS: Watch html at 75% scale of browser please, sizing tableau dashboards is painfull

📊 Portfolio Theme & Introduction

PS: I understand that the data was most likely generated by AI, but this does not negate the fact that AI could have used other datasets in its memory as examples and relied on real-life companies, so we cannot completely dismiss the plausibility of the dataset.

The main questions of my research are as follows: What exactly does product X do, what are its specific features? What steps should product X take in the future to maximize its key metrics and profits? To this end, I will conduct a superficial analysis of the product to get an overall picture, and delve a little deeper into individual parts to provide data-driven advice.

Visualization 1: Dashboard "Analysis of audience size, retention rate, and RFM analysis worldwide for product X"

Main questions:
  • Which countries use our product the most? This question should show us where we should invest the most in marketing, and which countries have high potential due to their population size but do not use the product.
  • How interested are users in our product, and how many of them actually continue to use it? If our retention rate is low, we should consider adding gamification or other tools to increase audience interest, otherwise our CAC will be too high, considering the users who have gone beyond registration.
  • What is the distribution of the audience across RFM analysis segments? What is the revenue for each segment? Where is the greatest contrast and large audience coverage area? It is necessary to invest in transferring the audience between these two segments and adding features specifically for them.

The answers to these questions will enable you to optimize marketing campaign costs and allocate the savings to converting your audience across RFM segments using additional features to maximize profits.

  • Why did you choose this chart type? Map is an easy choice to show the amount of users per country, since it uses visual memory of a user, and is easier than reading names of 100+ countries on a bar chart. Linear chart is mainly used to display trend over time, and that's exactly what retention rate is. Stacked bar chart in RFM analysis was used to replace area chart which showed the data incorectly, and helps with visualizing the quantitive amount of a metric at every date, by every segment.
  • What visual encoding choices did you make (position, color, size, etc.)? In the map I used color encoding. The colors were chosen like that since we need both precise amount of users (so we can't use categorical colors pallete, which would require previous split into categories), but clearly show categories of countries: "very small amount of users", "medium amount", "the most". In the RFM analysis I also chose colors, but this time I needed categorical pallete to distinguish the RFM segments. Retention rate doesn't need any encodings
  • What alternatives did you consider and why did you choose this approach? I also considered using area chart for RFM, but the problem was that it showed "phantom payments". When a segment made payments with the pause of 3 days, it still connected these points through dates when no payments were made. It was innacurete. With bar chart such problem is abscent.
  • What are the strengths and tradeoffs of your choice? We can still clearly see the contrast between revenue and amount of users, but now it is precisely accurate

Main insights are: our main audience is Ukraine and US. Peru, Brazil and South Africa are close. We may look at the Asian region. China and India with their population are an easu choice for the next marketing campaign.
Our retention rate by activity is really low, 10% would be a healthy percentage. We need to work on maintaining the audience and look at the funnel to understand where we are loosing the most people.
Looking at our RFM analysis, we can understang that the most audience is still at the "New Customer" level, however, potential loyalists are making almost the same revenue. Of cource the contrast is bigger on champions or loyals, but if we invest on them -- we are covering very small amount of people. So my advice is to first of all make a customer potential loyalist. Maybe some bonuses from the first 2-3 payments.

📂 Data Source

Genesis Dataset from Product analytics course

[dataset is in the zip file]

Visualization 2: Gender and age distribution by country, and the dependence of the average check on gender and age distribution

Main questions:
  • Which age group and gender should we focus on in our marketing campaign?
  • Is there a correlation between % of people over 35 years and the AOV?

The answers to these two questions will optimize our marketing campaign once again, and give us interesting insights about where should we develop our product further, depending on our correlation.

  • Why did you choose this chart type? Map is an easy choice to show the AOV per country, since it uses visual memory of a user, and is easier than reading names of 100+ countries on a bar chart. Scatter plot is mainly used for showing trends and correlation, which alligns perfectly with our task. Pie chart is used to show the percentage out of 100 for something. That's why it shows gender distribution. But why isn't it used in Age distribution? Because age groups are ordinal data, which is better be shown in the right order. Bar chart helps with that.
  • What visual encoding choices did you make (position, color, size, etc.)? In the map I used color encoding. The colors were chosen like that since we need both precise AOV (so we can't use categorical colors pallete, which would require previous split into categories), but clearly show categories of countries: "very high AOV", "medium AOV", "small AOV". In the age distribution I also used color to distinguish genders, and used categorical palette. In the age group I chose to use sequential since it's not completely different categories, more like shades of one variable - age.
  • What alternatives did you consider and why did you choose this approach? I also considered using pie chart for age distributoin, but text settings, color palette and easy-to-understand order would be hard to set up.
  • What are the strengths and tradeoffs of your choice? We can easily see the correlation, set the minimum amount of orders to take into account, our colors on the map are adjusting to this filter, our distibutions are visually easu to understand.
Absolute most of our audience is male, and main age group is 35-44. Depending on the minimum order we use, the correlation differs, but overally we can see that it exists, because older audience has more money and makes most of our audience. So our main target audience is men over 35.

📂 Data Source

Genesis Dataset from Product analytics course

[dataset is in the zip file]

Visualization 3: Churn Rate & Session Length Heatmap

Main questions:
  • What is our situation with the churn rate and how it's been over time?
  • How does the churn rate affect average session duration?
  • Were there any anomallies over time?

The answers to these questions will help us see the current situation of a product in terms of how many people are leaving the product, the trend over time, and anomalies which may tell us the story of latest updates of the product.

  • Why did you choose this chart type? Lina chart is the best decision to show linear trends over time. Heatmap only shows top 10 countries by the total revenue to show the most statisticlly significant ststistics. And heatmap is used because line chart with 10 lines would be hard to understand.
  • What visual encoding choices did you make (position, color, size, etc.)? Heatmap is encoded with sequential colors so the user can clearly see the differents in shades over time.
  • What alternatives did you consider and why did you choose this approach? I also considered using line chart for session duration, but as I mentioned, it would overload the vizualization, and the "mean line" from these 10 countries wouldn't let us look at each country separately, because maybe the problem was in translation to a specific language or in any action that took place only in one country.
  • What are the strengths and tradeoffs of your choice? We can easily see the change in shades with different churn rate and make our conclusions by the change in avg. session duration per country.
Animated Churn Rate and Session Length Heatmap
Near April 19 there is a peak of churn rate. And right at this moment we can see that all the shades on the lower graph transform into light yellow, almost identical. What it means? Product X probably added a feature or changed its design to the point that many users just left the product. However, the churn rate in real life has the healthy level. After deleting the feature the churn rate fell dramatically, but the session duration didn't change. What it means? Most of users who left the product are the "New users", and only more loyal ones stayed. So churn rate is low, because loyal customers won't leave over a small change in design. But they are enjoying the product less since the session duration fell. And after we deleted the change, we lost a part of their trust, now we must return it with some bonuses. At the end churn rate is rising as well as session duration due to the fact that we are fetting more new users. Some leave, some stay for a while checking all features.

📂 Data Source

Genesis Dataset from Product analytics course

[dataset is in the zip file]

R Code Used for Visualization 3


library(tidyverse)
library(lubridate)
library(ggplot2)
library(gganimate)
library(gifski)
library(patchwork)
activities <- read_csv("C:/Users/Asus/Downloads/activities.csv")
activity_type_dict <- read_csv("C:/Users/Asus/Downloads/activity_type_dict.csv")
country_info_dict <- read_csv("C:/Users/Asus/Downloads/country_info_dict.csv")
feature_type_dict <- read_csv("C:/Users/Asus/Downloads/feature_type_dict.csv")
payments <- read_csv("C:/Users/Asus/Downloads/payments.csv")
registrations <- read_csv("C:/Users/Asus/Downloads/registrations.csv")
sessions <- read_csv("C:/Users/Asus/Downloads/sessions.csv")

head(activities)
head(sessions)
head(registrations)

data_merged <- registrations |>
  left_join(sessions, by = "user_id")

head(data_merged)
data_merged <- data_merged |>
  mutate(session_start_dt  = as.Date(session_start_dt ))

data_merged <- data_merged |>
  group_by(user_id) |>
  arrange(user_id, session_start_dt) |>
  mutate(
    diff_days = c(NA, diff(session_start_dt)),
    churn_flag = if_else(is.na(diff_days) | diff_days > 4, 1, 0)
  ) |>
  ungroup()

churn_rate <- data_merged |>
  group_by(session_start_dt) |>
  summarise(
    churn_rate = mean(churn_flag, na.rm = TRUE)
  )

head(churn_rate)


churn_rate <- data_merged |>
  group_by(session_start_dt) |>
  summarise(churn_rate = mean(churn_flag, na.rm = TRUE)) |>
  arrange(session_start_dt) |>
  mutate(frame_id = row_number())

animated_churn_rate <- ggplot(churn_rate, aes(x = session_start_dt, y = churn_rate)) +
  geom_line(aes(group = 1), color = "blue", size = 1) +
  geom_point(color = "darkblue", size = 3) +
  labs(
    title = "Churn Rate по датах сесій",
    x = "Дата сесії",
    y = "Churn Rate"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  transition_reveal(frame_id) + 
  ease_aes('linear')

anim <- animate(
  animated_churn_rate, 
  renderer = gifski_renderer(),
  nframes = 100,
  fps = 10,
  width = 800,
  height = 600,
  end_pause = 10
)

anim

avg_session_length <- data_merged |>
  filter(!is.na(duration)) |>
  group_by(session_start_dt) |>
  summarise(avg_session_length = mean(duration, na.rm = TRUE)) |>
  arrange(session_start_dt) |>
  mutate(frame_id = row_number())

p2 <- ggplot(avg_session_length, aes(x = session_start_dt, y = avg_session_length)) +
  geom_line(aes(group = 1), color = "darkgreen", size = 1) +
  geom_point(color = "green", size = 3) +
  labs(
    title = "Середня довжина сесії по датах",
    x = "Дата сесії",
    y = "Середня довжина (хв)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  transition_reveal(frame_id) +
  ease_aes('linear')

combined_plot <- anim / p2

anim_combined <- animate(
  combined_plot, 
  renderer = gifski_renderer(),
  nframes = 100,
  fps = 10,
  width = 800,
  height = 1000,
  end_pause = 10
)

anim_combined







churn_rate_long <- churn_rate |>
  mutate(metric = "Churn Rate",
         value = churn_rate)

avg_session_long <- avg_session_length |>
  mutate(metric = "Середня довжина сесії (хв)",
         value = avg_session_length)

combined_data <- bind_rows(
  churn_rate_long |> select(session_start_dt, frame_id, metric, value),
  avg_session_long |> select(session_start_dt, frame_id, metric, value)
)

animated_combined <- ggplot(combined_data, aes(x = session_start_dt, y = value)) +
  geom_line(aes(group = 1, color = metric), size = 1) +
  geom_point(aes(color = metric), size = 3) +
  facet_wrap(~ metric, ncol = 1, scales = "free_y") + 
  labs(
    title = "Метрики по датах сесій",
    x = "Дата сесії",
    y = "Значення"
  ) +
  scale_color_manual(values = c("Churn Rate" = "blue", 
                                "Середня довжина сесії (хв)" = "darkgreen")) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  ) +
  transition_reveal(frame_id) +
  ease_aes('linear')

anim_combined <- animate(
  animated_combined, 
  renderer = gifski_renderer(),
  nframes = 100,
  fps = 10,
  width = 1280,
  height = 1000,
  end_pause = 10
)
library(av)
anim_combined
anim_save("combined_animation.gif", animation = anim_combined, path = "C:/Users/Asus/Downloads", renderer = gifski_renderer())



library(magick)
p1 <- ggplot(churn_rate, aes(x = session_start_dt, y = churn_rate)) +
  geom_line(aes(group = 1), color = "blue", size = 1.5) +
  geom_point(color = "darkblue", size = 4) +
  labs(
    title = "Churn Rate по датах сесій",
    x = "Дата сесії",
    y = "Churn Rate"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 14, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 14),
    plot.margin = margin(20, 20, 10, 20)
  ) +
  transition_reveal(frame_id) +
  ease_aes('linear')

top_countries_revenue <- data_merged |>
  left_join(payments, by = "user_id") |>
  filter(!is.na(amount_usd)) |>
  group_by(country_code) |>
  summarise(total_revenue = sum(amount_usd, na.rm = TRUE)) |>
  arrange(desc(total_revenue)) |>
  slice(1:10) |>
  pull(country_code)

heatmap_data <- data_merged |>
  filter(country_code %in% top_countries_revenue,
         !is.na(duration))  |> 
  group_by(country_code, session_start_dt) |>
  summarise(avg_session_length = mean(duration, na.rm = TRUE),
            .groups = 'drop')

p2 <- ggplot(heatmap_data, aes(x = session_start_dt, y = country_code, fill = avg_session_length)) +
  geom_tile(color = "white", size = 0.5) +
  scale_fill_gradient2(low = "lightblue", mid = "yellow", high = "red", 
                       midpoint = median(heatmap_data$avg_session_length, na.rm = TRUE),
                       name = "Хвилини") +
  labs(
    title = "Середня довжина сесії: Топ-10 країн за Revenue",
    x = "Дата сесії",
    y = "Країна"
  ) +
  theme_minimal(base_size = 16) +
  theme(
    plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 18, face = "bold"),
    axis.text.x = element_text(size = 14, angle = 45, hjust = 1),
    axis.text.y = element_text(size = 14, face = "bold"),
    legend.title = element_text(size = 16, face = "bold"),
    legend.text = element_text(size = 14),
    plot.margin = margin(10, 20, 20, 20)
  ) +
  transition_time(session_start_dt) +
  ease_aes('linear')

anim1 <- animate(
  p1, 
  renderer = gifski_renderer(),
  nframes = 100,
  fps = 10,
  width = 1000,
  height = 500,
  res = 120
)

anim2 <- animate(
  p2, 
  renderer = gifski_renderer(),
  nframes = 100,
  fps = 10,
  width = 1000,
  height = 600,
  res = 120
)

anim_save("temp_churn.gif", anim1)
anim_save("temp_heatmap.gif", anim2)

churn_gif <- image_read("temp_churn.gif")
heatmap_gif <- image_read("temp_heatmap.gif")

combined_gif <- image_append(c(churn_gif[1], heatmap_gif[1]), stack = TRUE)
for(i in 2:100){
  combined <- image_append(c(churn_gif[i], heatmap_gif[i]), stack = TRUE)
  combined_gif <- c(combined_gif, combined)
}

image_write(combined_gif, "combined_churn_heatmap.gif")

combined_gif

file.remove("temp_churn.gif")
file.remove("temp_heatmap.gif")




n_frames_churn <- length(churn_gif)
n_frames_heatmap <- length(heatmap_gif)
n_frames <- min(n_frames_churn, n_frames_heatmap)

cat("Кількість кадрів у churn_gif:", n_frames_churn, "\n")
cat("Кількість кадрів у heatmap_gif:", n_frames_heatmap, "\n")
cat("Використовуємо кадрів:", n_frames, "\n")

combined_gif <- image_append(c(churn_gif[1], heatmap_gif[1]), stack = TRUE)

for(i in 2:n_frames){
  combined <- image_append(c(churn_gif[i], heatmap_gif[i]), stack = TRUE)
  combined_gif <- c(combined_gif, combined)
}

image_write(combined_gif, path = "C:/Users/Asus/Downloads/combined_churn_heatmap1.gif")

combined_gif

file.remove("temp_churn.gif")
file.remove("temp_heatmap.gif")

    

🔎 Final Conclusions

Looking at all the data together, Product X is best described as a streaming service.
Activities include likes, comments, and at the same time, people can pay (different amount of money), and get different statuses (vip, premium). It looks like twitch (but without live streams, because then comments would make up most of activities) with the internal contribution function (bits are horrible)
The audience is clearly male-dominated, with the strongest group being men aged 35+. Older users spend more, stay longer, and drive most of the revenue.
RFM results show that many users are still new, while potential loyalists already bring in almost as much revenue as top segments. This means the biggest opportunity is not chasing new users, but turning new users into regular viewers.
Product lost a big part of the trust at April 19, and now we have to rebuild it again. Very important: product must hold A/B test before adding a new feature or change design, to avoid such mistakes.
Taken together, the data strongly suggests that Product X is a streaming service focused on topics primarily interesting to men. The smartest next steps are to protect trust, hold A/B tests to avoid disruptive changes, and invest in content and incentives that move users toward habitual use. Marketing campaigns must be focused on Asian region, US, Ukraine, and Latin America (most potential + already make up most of the audience).