Welcome to my Bellabeat data analysis case study! In this project, I took on the role of a junior data analyst at Bellabeat, a high-tech company revolutionizing women’s wellness through products like the Leaf tracker and Ivy+ wearable. I collaborated with cross-functional teams, analyzed real-world health data, and followed the data analysis process-ask, prepare, process, analyze, share, and act-to uncover actionable insights for improving user engagement and product design.
Using Bellabeat’s ecosystem of smart jewelry, hydration trackers, and app-based wellness tools as my focus, I structured my workflow around the Case Study Roadmap tables. These kept me aligned with key tasks, from defining business questions about user activity patterns to visualizing how menstrual cycle data could inform personalized health recommendations.
By the end of this project, I developed a portfolio-ready case study that demonstrates my ability to transform raw data into strategic insights. I’ve included detailed documentation of my process, which I now reference regularly to showcase my analytical skills to potential employers. This case study not only highlights my technical expertise but also my understanding of how data can empower women’s health-a mission central to Bellabeat’s work.
As a junior data analyst on Bellabeat’s marketing team, I took on a critical project to unlock growth opportunities for our health-focused smart devices. Bellabeat, a leader in women’s wellness tech, already had a strong foothold with products like the Leaf Urban (a stress-predicting smart necklace) and Spring (a hydration tracker), but Urška Sršen, our cofounder and Chief Creative Officer, challenged me to analyze consumer smart device data to identify untapped potential in the global market.
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
The business task was to analyse smart device usage data in order to
gain insights into how consumers use non-Bellabeat smart devices. Then
selecting one Bellabeat product to apply these insights to in the
presentation. The business questions are:
1. What are some trends in smart device usage?
2. How could these trends apply to Bellabeat
customers?
3. How could these trends help influence Bellabeat marketing
strategy?
• Urška Sršen: Bellabeat cofounder and Chief Creative Officer
• Sando Mur: Cofounder, Mathematician, Key member of the executive
team
• marketing analytics team: Responsible for collecting, analysing and
reporting data to guide marketing strategy. I’m a junior data analyst in
the team.
• Bellabeat app: The app provides users with data
about their activity, sleep, stress, menstrual cycle and mindfulness
habits and helps people better understand their heath and habits. The
app connects to their line of smart wellness products.
• Leaf: a wellness tracker can be worn as a bracelet,
necklace and clip. Leaf can connect with Bellabeat app to track
activity, sleep and stress.
• Time: A wellness watch can track activity, sleep, and
stress and connect with Bellabeat app. A watch provides you with
insights about your daily wellness.
• Spring: A water bottle tracks daily water intake to
ensure you are hydrated enough daily by applying the state-of-art smart
technology. Spring can connect with the app and track your hydration
level.
• Bellabeat membership: a subscription-based program
for users 24/7 access to fully personalized guidance on nutrition,
activity, sleep, health and beauty, and mindfulness based on their
lifestyle and goals.
The data for this analysis is the Fitbit Fitness Tracker Data stored
on Kaggle and made available by Mobius. This dataset is under CC0:
Public Domain license meaning the creator has waived his right to the
work under copyright law.
Visit fitbit
dataset on Kaggle (CC0: Public Domain, dataset made available
through Mobius)
This dataset contains personal health data from 30 fitbit users. All users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes data about daily activity, steps and heart rate that can be used to explore users’ habits. Totally, I downloaded 18 CSV files and stored them properly.
This data set contains a personal fitness tracker from thirty fitbit users. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. It was generated via a survey through Amazon Mechanical Turk between 12.03.2016 and 12.05.2016. Thirty eligible Fitbit users consented to submit their tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. Variation between output represents different Fitbit trackers and individual tracking behaviors/preferences.
The dataset has the following 18 files in .csv format.
• dailyActivity_merged.csv
• dailyCalories_merged.csv
• dailyIntensities_merged.csv
• dailySteps_merged.csv
• heartrate_seconds_merged.csv
• hourlyCalories_merged.csv
• hourlyIntensities_merged.csv
• hourlySteps_merged.csv
• minuteCaloriesNarrow_merged.csv
• minuteCaloriesWide_merged.csv
• minuteIntensitiesNarrow_merged.csv
• minuteIntensitiesWide_merged.csv
• minuteMETsNarrow_merged.csv
• minuteSleep_merged.csv
• minuteStepsNarrow_merged.csv
• minuteStepsWide_merged.csv
• sleepDay_merged.csv
• weightLogInfo_merged
• Sample size: the dataset collects the information from 30
individuals, which is a small sample size to determine behavior patterns
of a population of people who use smart devices
• The absence of essential characteristics of the participants might
have a significant impact on the data collected, such as gender, age,
and lifestyle
• Third-party data collected using Amazon Mechanical Turk.
• The data was collected six years ago. It is not as current as it could
be for the analysis.
• Reliable: low — the data collected from users
without demographic information
• Originality: low — the data was collected from
third-party Amazon Mechanical Turk
• Comprehensive: high — the data contained personal
health data which allowed me to answer business questions
• Current: low — the respondents were generated during
03.12.2016–05.12.2016.
• Cited: high — the data source was
well-documented.
R Studio was used to complete this analysis because it has many excellent data processing and visualization features.
Install packages
# Package names
packages <- c(
"tidyverse",
"here",
"knitr", "rmarkdown",
"ggplot2", "dplyr", "tidyr", "janitor",
"lubridate",
"todor","lintr",
"DT", "kableExtra",
"roxygen2", "testthat",
"ggpubr"
)
#Install packages not yet installed
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Packages loading
lapply(packages, library, character.only = TRUE) |>
invisible()
Set Working Directory
getwd()
Import Data
For this analysis, I imported three key datasets:
• dailyActivity_merged.csv
• sleepDay_merged.csv
• hourlySteps_merged.csv
Daily_Activity <- read.csv("dailyActivity_merged.csv", header = TRUE, sep = ",")
Daily_Sleep <- read.csv("sleepDay_merged.csv", header = TRUE, sep = ",")
Hourly_Steps <- read.csv("hourlySteps_merged.csv", header = TRUE, sep = ",")
Preview and Understand Data Structure
head(Daily_Activity)
str(Daily_Activity)
head(Daily_Sleep)
str(Daily_Sleep)
head(Hourly_Steps)
str(Hourly_Steps)
• Duplicate and Missing Entries:
The datasets contain duplicate records and missing (NA) values, which
can lead to inaccurate analysis if not addressed.
• Inconsistent Date and Time Formats:
Date and time columns appear in multiple formats across datasets, making
it difficult to merge and compare records without standardization.
• Mismatched Data Types for Primary Keys:
Key columns used for merging datasets (such as id and date fields) have
inconsistent data types (e.g., character vs. Date), which can cause
issues during data integration.
These issues must be resolved during the data preparation phase to ensure the reliability and accuracy of subsequent analyses.
Identify and Remove Duplicates & Missing Values
sum(duplicated(Daily_Activity))
## [1] 0
sum(duplicated(Daily_Sleep))
## [1] 3
sum(duplicated(Hourly_Steps))
## [1] 0
Daily_Activity <- Daily_Activity %>%
distinct() %>%
drop_na()
Daily_Sleep <- Daily_Sleep %>%
distinct() %>%
drop_na()
Hourly_Steps <- Hourly_Steps %>%
distinct() %>%
drop_na()
Standardize Column Names
Convert all column names to lowercase for consistency and easier data manipulation.
clean_names(Daily_Activity)
Daily_Activity <- rename_with(Daily_Activity, tolower)
clean_names(Daily_Sleep)
Daily_Sleep <- rename_with(Daily_Sleep, tolower)
clean_names(Hourly_Steps)
Hourly_Steps <- rename_with(Hourly_Steps, tolower)
Format Date and Time Columns
• Convert date columns to a standard format in each dataset.
• Parse and format time columns as required for later analysis.
Daily_Activity <- Daily_Activity %>%
mutate(activitydate = format(parse_date_time(activitydate, orders = c("mdy", "mdY")), "%Y-%m-%d"))
Daily_Sleep <- Daily_Sleep %>%
mutate(
sleepday = as.Date(
parse_date_time(sleepday, orders = c("mdy IMS p", "mdy HM"))
)
)
Hourly_Steps <- Hourly_Steps %>%
mutate(
activityhour = as.POSIXct(activityhour, format = "%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone()))
Prepare for Merging
• Rename columns as needed for consistency (e.g., ‘sleepday’ to ‘activitydate’).
Daily_Sleep <- Daily_Sleep %>%
rename(activitydate = sleepday)
• Ensure key columns (‘id’ and ‘activitydate’) have the same data type in both datasets.
str(Daily_Activity)
## 'data.frame': 940 obs. of 15 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activitydate : chr "2016-04-12" "2016-04-13" "2016-04-14" "2016-04-15" ...
## $ totalsteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ totaldistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ trackerdistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ loggedactivitiesdistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactivedistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ moderatelyactivedistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ lightactivedistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ sedentaryactivedistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactiveminutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ fairlyactiveminutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ lightlyactiveminutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ sedentaryminutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(Daily_Sleep)
## 'data.frame': 410 obs. of 5 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ activitydate : Date, format: "2016-04-12" "2016-04-13" ...
## $ totalsleeprecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ totalminutesasleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ totaltimeinbed : int 346 407 442 367 712 320 377 364 384 449 ...
Daily_Activity$activitydate <- as.Date(Daily_Activity$activitydate)
Daily_Sleep$activitydate <- as.Date(Daily_Sleep$activitydate)
Merge Datasets
Combine daily activity and sleep datasets into a single dataframe using ‘id’ and ‘activitydate’ as primary keys.
Daily_Activity_Sleep <- merge(Daily_Activity, Daily_Sleep, by=c ("id", "activitydate"))
Create New Variables
Daily_Activity_Sleep <- Daily_Activity_Sleep %>%
mutate(totalactiveminutes = veryactiveminutes + fairlyactiveminutes + lightlyactiveminutes + sedentaryminutes)
Further Data Transformation
• Separate date and time in the hourly steps dataset for time-based
analysis.
• Replace missing time values with a default (e.g., “00:00:00”).
Hourly_Steps <- Hourly_Steps %>%
separate(activityhour, into = c("date", "time"), sep = " ") %>%
mutate(date = ymd(date))
Hourly_Steps <- Hourly_Steps %>%
mutate(time = if_else(is.na(time), "00:00:00", time))
Summarizing the Data
Determine min, max, and average values for the variables of Daily_Activity_Sleep data frame.
Daily_Activity_Sleep %>%
select(totalsteps, totaldistance, veryactivedistance, moderatelyactivedistance, lightactivedistance, sedentaryactivedistance, calories, totalminutesasleep, totalactiveminutes) %>%
summary()
## totalsteps totaldistance veryactivedistance moderatelyactivedistance
## Min. : 17 Min. : 0.010 Min. : 0.000 Min. :0.0000
## 1st Qu.: 5189 1st Qu.: 3.592 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 8913 Median : 6.270 Median : 0.570 Median :0.4200
## Mean : 8515 Mean : 6.012 Mean : 1.446 Mean :0.7439
## 3rd Qu.:11370 3rd Qu.: 8.005 3rd Qu.: 2.360 3rd Qu.:1.0375
## Max. :22770 Max. :17.540 Max. :12.540 Max. :6.4800
## lightactivedistance sedentaryactivedistance calories totalminutesasleep
## Min. :0.010 Min. :0.0000000 Min. : 257 Min. : 58.0
## 1st Qu.:2.540 1st Qu.:0.0000000 1st Qu.:1841 1st Qu.:361.0
## Median :3.665 Median :0.0000000 Median :2207 Median :432.5
## Mean :3.791 Mean :0.0009268 Mean :2389 Mean :419.2
## 3rd Qu.:4.918 3rd Qu.:0.0000000 3rd Qu.:2920 3rd Qu.:490.0
## Max. :9.480 Max. :0.1100000 Max. :4900 Max. :796.0
## totalactiveminutes
## Min. : 2.0
## 1st Qu.: 906.2
## Median : 983.0
## Mean : 971.6
## 3rd Qu.:1042.0
## Max. :1398.0
DT::datatable(
Daily_Activity_Sleep,
options = list(scrollX = TRUE),
caption = "Dataset 1 : Merged Daily Activity and Sleep Data"
)
DT::datatable(
Hourly_Steps,
caption = "Table 2 : Hourly Steps (average)"
)
• Calculate average daily steps for each user.
• Categorize users into ‘very active’, ‘fairly active’, ‘lightly
active’, or ‘sedentary’ based on their average steps.
Daily_Average <- Daily_Activity_Sleep %>%
group_by(id) %>%
summarise (mean_daily_steps = mean(totalsteps), mean_daily_calories = mean(calories), mean_daily_sleep = mean(totalminutesasleep))
Daily_Average <- Daily_Average %>%
mutate(
user_type = case_when(
mean_daily_steps > 10000 ~ "very active",
mean_daily_steps > 7500 ~ "fairly active",
mean_daily_steps > 5000 ~ "lightly active",
TRUE ~ "sedentary"
)
)
• Calculate the percentage of users in each activity category.
user_type_percent <- Daily_Average %>%
group_by(user_type) %>%
summarise(user_count = n_distinct(id)) %>%
mutate(percentage = paste0(round(100 * user_count / sum(user_count), 2), " %"))
DT::datatable(
head(user_type_percent),
caption = "Table 1 : User Type with Percentage",
width = "60%"
)
Hourly_Steps_summary <- Hourly_Steps %>%
group_by(time) %>%
summarize(average_steps = mean(steptotal))
htmltools::div(
style = "text-align:center;",
DT::datatable(
Hourly_Steps_summary,
caption = "Table 2 : Hourly Steps (average)",
width = "60%"
) %>%
DT::formatRound('average_steps', 2) %>%
DT::formatStyle(
columns = 'average_steps',
'text-align' = 'center'
)
)
Daily steps vs daily sleep Correlation Value
round(cor(Daily_Activity_Sleep$totalsteps, Daily_Activity_Sleep$totalminutesasleep, use = "complete.obs"), 2)
## [1] -0.19
Daily steps vs calories Correlation Value
round(cor(Daily_Activity_Sleep$totalsteps, Daily_Activity_Sleep$calories, use = "complete.obs"), 2)
## [1] 0.41
Daily steps vs calories Correlation Value
round(cor(Daily_Activity_Sleep$totalsteps, Daily_Activity_Sleep$calories, use = "complete.obs"), 2)
## [1] 0.41
Total distance vs calories Correlation Value
round(cor(Daily_Activity_Sleep$totaldistance, Daily_Activity_Sleep$calories, use = "complete.obs"), 2)
## [1] 0.52
Daily sleep vs Calories sleep Correlation Value
round(cor(Daily_Activity_Sleep$totalminutesasleep, Daily_Activity_Sleep$calories, use = "complete.obs"), 2)
## [1] -0.03
1. Target “Fairly Active” and “Lightly Active”
Users
• Insight: The majority of users are “fairly active” or
“lightly active,” with fewer in the “very active” or “sedentary”
categories.
• Action:
• Develop personalized challenges, motivational messages, or rewards
within the Bellabeat app to encourage these users to increase their
daily step count and transition to higher activity categories.
• Use push notifications or in-app reminders to nudge lightly active
users during times of low activity.
2. Leverage Hourly Activity Patterns
• Insight: Step counts are lowest during late night and
early morning, peaking in the late afternoon and evening.
• Action:
• Schedule app notifications, wellness tips, or mini-challenges during
mid-morning or early afternoon to boost activity during low-movement
periods.
• Consider offering guided walks or stretch breaks through the app
during these times.
3. Capitalize on Step-Calories Relationship
• Insight: There is a moderate positive correlation
between steps and calories burned.
• Action:
• Emphasize calorie-burn tracking and goal-setting features in marketing
and within the app.
• Provide users with feedback on how increasing their daily steps can
help them achieve calorie or weight-management goals.
4. Enhance Sleep Recommendations
• Insight: There is only a weak negative correlation
between steps and sleep duration.
• Action:
• When providing sleep advice, incorporate additional factors beyond
daily activity (e.g., stress, hydration, bedtime routines).
• Use Bellabeat’s holistic wellness features to offer more comprehensive
sleep health guidance.
5. Acknowledge Data Limitations
• Insight: The dataset is small, not demographically
diverse, and a few years old.
• Action:
• Supplement these findings with ongoing data collection from
Bellabeat’s own user base for more representative and up-to-date
insights.
• Use these initial findings as a starting point for further research
and product development.
6. Marketing and Engagement Strategies
• Action:
• Highlight the app’s ability to track and improve activity and calorie
burn in marketing campaigns.
• Share user success stories and testimonials to motivate community
engagement.
• Consider partnerships or challenges that encourage users to be active
during typically low-activity hours.
Working on the Bellabeat case study was a valuable experience that deepened both my technical and business analytics skills. Here’s what I learned:
1. The Power of Structured Analysis
I experienced firsthand the importance of following a systematic data
analysis process-Ask, Prepare, Process, Analyze, Share, and Act. This
framework kept my work organized and ensured that every step, from
defining business questions to delivering actionable recommendations,
was purposeful and aligned with stakeholder needs.
2. Data Cleaning is Critical
Real-world data is rarely perfect. I encountered issues like duplicates,
missing values, inconsistent formats, and mismatched data types.
Addressing these challenges taught me that thorough data cleaning is
foundational for any reliable analysis and that attention to detail at
this stage saves time and prevents errors later.
3. Transforming Data into Insights
I learned how to merge and transform multiple datasets to create a
unified view of user activity, sleep, and steps. Summarizing and
visualizing this data-such as classifying users by activity level or
analyzing hourly step patterns-showed me how raw numbers can be turned
into business-relevant insights.
4. The Value of Visualization
Creating clear, compelling visuals (like pie charts for user types and
bar charts for hourly activity) made it easier to communicate findings
to non-technical stakeholders. I saw how effective visualizations can
drive understanding and support decision-making.
5. Understanding Business Impact
Beyond technical analysis, I learned to connect insights to business
actions. For example, identifying when users are least active led to
recommendations for targeted notifications, and understanding user
activity levels informed ideas for product engagement strategies.
6. Recognizing Data Limitations
This project highlighted the importance of critically evaluating data
sources. The Fitbit dataset’s small sample size and lack of demographic
diversity reminded me that all analyses have limitations, and that
transparency about these is essential for responsible data science.
7. Portfolio and Communication Skills
Documenting my process in detail not only helped me reflect on my work
but also created a portfolio-ready case study I can share with potential
employers. It reinforced the importance of clear communication-both in
code and in business reporting.
In summary:
This case study strengthened my technical skills in R and data
wrangling, my ability to extract actionable insights from complex data,
and my understanding of how analytics can directly support business
goals in the health tech industry. Most importantly, it showed me how
data can be used to empower better health outcomes for real people-a
mission I’m excited to pursue in my future career.