CYCLISTIC BIKE-SHARE CASE STUDY
The following analysis is originally based on the case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study). We will be using the Divvy dataset for the case study, as part of the Google Data Analytics Program. The data has been made publicly available by Motivate International Inc. under this license. Due to data-privacy, any personally-identifiable information has been removed/encrypted.
Scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Business Task
How do annual members and casual riders use Cyclistic bikes differently?
Prepare/Process Data
Cyclistic's historical trip data (May 2020-April 2021) was provided to analyze and identify trends (internal/first-party data provided as .csv spreadsheets)
This data includes de-identified User IDs, user type, bike type, and start & end details (times, positions, station names & IDs).
Original files are stored in a separate directory and copies are made of each dataset, in case the originals need to be referenced.
Given the large sizes of the datasets, I will be using R via RStudio to prepare, process, and analyze.
#==========================================================================
# STEP-1: Installation of required packages & change of directory for ease
#==========================================================================
library(tidyverse) #helps wrangle data
library(lubridate) #helps wrangle data attributes
library(ggplot2) #helps to visualize data
getwd() #displays your working directory
setwd("C:/Users/TEMP/Downloads/GDA/Cyclistic Bike-Share") #sets your working directory to simplify calls to data
#setwd() should be used in desktop version of R
#==========================================================================
# STEP-2: Import data into R
#==========================================================================
# read_csv() imports data from .csv files
m5_2020 <- read_csv("202005-divvy-tripdata.csv")
m6_2020 <- read_csv("202006-divvy-tripdata.csv")
m7_2020 <- read_csv("202007-divvy-tripdata.csv")
m8_2020 <- read_csv("202008-divvy-tripdata.csv")
m9_2020 <- read_csv("202009-divvy-tripdata.csv")
m10_2020 <- read_csv("202010-divvy-tripdata.csv")
m11_2020 <- read_csv("202011-divvy-tripdata.csv")
m12_2020 <- read_csv("202012-divvy-tripdata.csv")
m1_2021 <- read_csv("202101-divvy-tripdata.csv")
m2_2021 <- read_csv("202102-divvy-tripdata.csv")
m3_2021 <- read_csv("202103-divvy-tripdata.csv")
m4_2021 <- read_csv("202104-divvy-tripdata.csv")
Step 3: Wrangle Data and Combine into a Single DataFrame
Compare column names and data types and consolidate/make consistent
column names and data types are shown upon import of each .csv file | or use str() function
Drop the following columns:
start_station_name
start_station_id
end_station_name
end_station_id
Why? Numerous rows show incomplete entries and similar identifying information can be gleaned from the start_lat, start_lng, end_lat, end_lng column entries
m4_2021 <- m4_2021 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m3_2021 <- m3_2021 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m2_2021 <- m2_2021 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m1_2021 <- m1_2021 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m12_2020 <- m12_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m11_2020 <- m11_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m10_2020 <- m10_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m9_2020 <- m9_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m8_2020 <- m8_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m7_2020 <- m7_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m6_2020 <- m6_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
m5_2020 <- m5_2020 %>% select(-c(start_station_name, start_station_id, end_station_name, end_station_id))
# Combine dataframes
all_trips <- bind_rows(m5_2020, m6_2020, m7_2020, m8_2020, m9_2020, m10_2020, m11_2020, m12_2020, m1_2021, m2_2021, m3_2021, m4_2021)
Step 4: Further Clean Up and Add Data to Prepare for Analysis
Check to make sure the proper number of observations are present
Add columns that list the date, month, day, day of the week, and year of each ride (this will allow us to aggregate data beyond the ride level)
Add a "ride_length" calculation to all_trips (in seconds)
#==========================================================================
# Check unique output values generated
#==========================================================================
# table(all_trips$member_casual) #results in either "member" or "casual"
# table(all_trips$rideable_type) #results in either "classic_bike", "docked_bike", or "electric_bike"
#==========================================================================
# Add data
#==========================================================================
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$ride_length <- difftime(all_trips$ended_at, all_trips$started_at)
The rideable_type "docked_bike" designates bikes that were taken out of circulation by Cyclistic to assess for quality control. There are also various trips where ride_length returns a negative duration. These entries need to be omitted from our dataframe. Doing so will bring the total rows from 3,742,202 to 1,243,579.
all_trips_v2 <- all_trips[!(all_trips$rideable_type == "docked_bike" | all_trips$ride_length<0),]
Analysis
INPUT
all_trips_v2 %>%
group_by(member_casual) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length))
OUTPUT
## A tibble: 2 x 3
# member_casual number_of_rides average_duration
# <chr> <int> <dbl>
#1 casual 425600 1480.
#2 member 817979 839.
INPUT
all_trips_v2 %>%
group_by(member_casual, rideable_type) %>%
summarise(number_of_rides = n())
OUTPUT
#`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
## A tibble: 4 x 3
## Groups: member_casual [2]
# member_casual rideable_type number_of_rides
# <chr> <chr> <int>
#1 casual classic_bike 141576
#2 casual electric_bike 284024
#3 member classic_bike 392911
#4 member electric_bike 425068
Key Findings
Users with an annual membership complete more rides than casual riders
Among casual riders, Fridays, Saturdays, and Sundays are the most popular usage days
Casual riders, on average, spend about 76% more time on their rides than users with an annual membership
About 67% of total bike rides by casual riders were via electric bikes, compared to about 52% for total bike rides by users with an annual membership
Recommendations
Implement a new annual membership tier at a lower price point where a total number of rides are allotted for a given time period (e.g. week, month), versus current annual membership structure (unlimited number of rides, 45-min. limit per ride), and given that on average, casual riders spend more time on their rides than current members, but on average only 1480 sec.(24.67 min.)
Implement a limited time promotion for annual membership that loosens limits on Friday, Saturday, Sunday rides, given those are the most popular days casual riders use Cyclistic bikes
Add more electric bikes to inventory given casual riders prefer them over classic bikes
Other Considerations for Further Exploration
Collect data on available inventory at docking station user is at at point of rental to analyze impact on resulting bike type choice
Collect data on full route of a bike during rental, not just start and end locations, to analyze if certain areas/geospatial factors constitute certain bike types (e.g. terrain)
Include data on whether a casual rider uses a single-ride pass or full-day pass, to analyze how pricing may impact usage
View this case study on Kaggle