Keywords: Transportation Planning, Rideshare Forecast, Spatial and
Serial Autocorrelation, Nested Tibble, OLS Regression, time lag,
gganimate, purrr
Introduction
Bikeshare programs have become an integral part of urban
transportation system, providing a sustainable and convenient
alternative to traditional commuting methods. Citi Bike is one of the
earliest bicycle sharing system created by Lyft in 2008 at NYC. But in
recent years, this program also extends to Jersey City across the Hudson
River in order to cater to the dynamic mobility needs of residents who
had to commute to Manhattan for work. As the popularity of Citi Bike
continues to soar, there arises a pressing need for effective management
strategies to ensure the availability of bikes at each station in Jersey
City.
Bikesharing (and ridesharing) inherently involves the challenge of
maintaining a delicate equilibrium in bike distribution, ie.
“re-balancing”. This issue arises from the spatial and temporal
variations in user demand. If certain stations experience a surge in
demand during morning rush hours while others witness a decline, this
could result in station clusters with depleted or surplus bike stocks,
thereby compromising user experience. Companies like Uber & Lyft
generate and analyze tremendous amounts of data to incentivize bike
(ride) share use, solve routing problems, calculate pricing, and
obviously forecast demands, but they are not very transparent about the
algorithm they use. Therefore, the goal of this report is keep a simple,
but open source version of bikeshare demand prediction, by harnessing of
power of machine learning, particularly the Ordinary Least Square (OLS)
regression to rapidly predict the demand for bikes across various
stations in Jersey City. We will perform analysis on the historical data
on bike usage at different stations and different time as well as under
different weather conditions. The forecast allows the Citi Bike program
to proactively address imbalances, strategically re-positioning bikes in
anticipation of peak usage time or special events.
The subsequent sections on this report is organized as follows.
Firstly, we load in our datasets for a quick examinations and perform
some necessary feature engineering. Secondly, we conduct exploratory
analysis on the temporal and spatial aspects of bikeshare data. We will
also look at some weather data in this process. Thirdly, we create a
space-time panel where each row is a unique instance of time-space
combination of rides and visualize the correlation between space, time,
weather, and number of rides. And finally, we make created four
different models, test their sensibility to errors, and conduct cross
validation.
The GitHub repository for this report is here.
Set Up
The data we use here is obtained from NYC OpenData
that contains the latest record of Citi Bike usage on a monthly base.
Note that while Jersey City is located in New Jersey, Citi Bike is
mainly operated in NYC and therefore its dataset is also operated by
NYC. We selected all bikeshare data in Jersey City from Aug 1 2023 to
Aug 31 2023 as our temporal frame. The dataset mainly includes locations
and names of stations of which the user started the ride as well as end
the ride. In addition, we also downloaded the census tract geometries of
the entire Hudson County via tidycensus, which includes Jersey City.
nycbike <- read.csv(url("https://raw.githubusercontent.com/emilyzhou112/MUSA5080-NJ-BikeShare/main/JC-202308-citibike-tripdata.csv"))
njCensus <- get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2021,
state = "NJ",
geometry = TRUE,
county=c("Hudson"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
jcTracts <- njCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf()
Since our raw bikeshare data is a non-spatial layer, despite
containing spatial attributes, our next step is to make it spatial by
joining it to the census geometry. We are only interested in the start
location of each ride because we want to predict the capacity of each
station when the user need a bike to start the ride. This requires us to
filter out rows with missing latitude and longitude. We also found that
some stations have erroneous longitude and latitude coding that lead to
these points being located in the Hudson River. These points also need
to be removed.
bike_census <- st_join(nycbike %>%
filter(is.na(start_lat) == FALSE &
is.na(start_lng) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lng) == FALSE) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
jcTracts %>% st_transform(crs=4326),
join=st_intersects, left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lng = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
filter(!(ride_id %in% c("BEC9CC4D1007C753", "86087F431785EDE7", "A1AAB92631439883", "71938DA085242DCC", "4203C77668C47C75", "C92410CA2AEF3947", "6E02CBF36C90F244", "0259885253FD238E", "B39136B37AFB5FE9", "1EC9376D1559C51C", "CDDF16D3A37BE370"))) %>%
as.data.frame()
The following visualization shows all the steps we did above.
ggplot()+
geom_sf(data = jcTracts %>%
st_transform(crs=4326), fill="white")+
geom_point(data = bike_census, aes(x=start_lng, y = start_lat),
color = "#6E0E0A", alpha = 1, size = 0.3) +
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
labs(title = "Location of Rides Starting Points in Jersey City") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
We will perform some basic feature engineering using R’s powerful
lubridate
package to extract some temporal information for
each ride. This includes calculating the 60min and 15min interval for
the starting time of each ride, and categorizing these intervals into by
the time of day into “Overnight”, “AM Rush”, “Mid-Day”, and “PM Rush”
and by day of week into weekday and weekend.
bike_census <- bike_census %>%
mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
mutate(hour = hour(started_at),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))
Interested in seeing the commuting patterns to get a sense of rider’s
daily behavior, we created line strings that connect between the start
and end locations of each ride. For the simplicity of visualization, we
only visualize 200 rides. We may see that most of the rides occur within
Jersey City, but there are also people who commute using Citi Bike from
Jersey City to Manhattan.
lines <- st_sfc(
lapply(1:nrow(bike_census), function(i) {
st_linestring(matrix(c(bike_census[i, "start_lng"], bike_census[i, "start_lat"],
bike_census[i, "end_lng"], bike_census[i, "end_lat"]), ncol = 2, byrow = TRUE))
}))
lines_sf <- st_sf(geometry = lines)
lines_sf <- st_set_crs(lines_sf, 4326)
Route <- lines_sf %>% head(200)
mapview(Route, zcol = NULL, color = "#6E0E0A", linewidth = 0.001, alpha = 0.3) # only show 200
Exploratory Analysis
In this section, we delve into some further exploratory analysis of
our data set.
Temporal Analysis
Starting with temporal analysis, we first visualized the bikeshare
trips per hour in Jersey City in August 2023. It is clear that there’s
some form of circularity in that there are both peak hours and non-peak
hours during a single day. Also, we may see that five consecutive higher
peaks are usually followed by two less higher peaks. These correspond to
high bikeshare need during rush hours and low bikeshare needs overnight,
in early mornings, and over the weekend
ggplot(bike_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n), color= "#6E0E0A")+
labs(title="Bike Share Trips Per Hour in Jersey City, Aug 2023",
x="Date",
y="Number of trips") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
Then, we visualized the distribution of trips by stations at each
hour in August. We may see that the distribution of rides is highly
right skewed, with most stations having zero or a few number of trips
each hour and a couple of stations having much more rides, up to 40 or
even 50 rides at a particular hour on a particular day in August
2023.
ggplot(bike_census %>%
group_by(interval60, start_station_name) %>%
tally())+
geom_histogram(aes(n), binwidth = 5, fill="#6E0E0A")+
labs(title="Bike Share Trips Per Hour by Station in Jersey City, Aug 2023",
x="Trip Counts",
y="Number of Stations") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
We continue to visualize the distribution of mean number of trips at
each station during different times of the day. We recognize that during
PM rush, the average number of trips has been the highest throughout,
with a couple of stations’ average trip exceed 10, implying a relatively
high demand for a single station. This is followed by ride demand in mid
day, when some stations can have more than 5 rides. It is also
surprising to notice that the demand is pretty high overnight, with a
few stations having more than 5 rides.
bike_census %>%
group_by(interval60, start_station_name, time_of_day) %>%
tally()%>%
group_by(start_station_name, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
scale_fill_manual(values = palette4, name="Time of Day") +
labs(title="Mean Number of Hourly Trips Per Station",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
After that, we visualized the total number of trips in August
separated by day of the week. The two main patterns here are that
firstly, trip numbers vary by time of the day. Lowest trip counts occur
in early morning while highest trip count occur in late afternoon.
Secondly, Tuesday, Wednesday, and Thursday tend to observe higher number
of trips.
ggplot(bike_census %>% mutate(hour = hour(started_at)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 3, lwd = 0.8)+
scale_color_manual(values = c("#124E78", "#819FA1", "#F0F0C9", "#F2BB05", "#E58507", "#D74E09", "#6E0E0A"), name = "Day") +
labs(title="Bike Share Trips in Jersey City by Day of the Week, Aug 2023",
x="Hour",
y="Trip Counts") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Same type of visualize but this time separated between weekend and
weekdays. It is very obvious that we see more trips during weekdays and
less trips during weekends. Also, during weekends, number of trips start
to increase much later than during weekdays, suggesting the missing of
morning rush hour will significantly impact people’s use of bikeshare
services.
ggplot(bike_census)+
geom_freqpoly(aes(hour, color = weekend), binwidth = 3, lwd = 1.2)+
scale_color_manual(values=palette2) +
labs(color = "Time of Week") +
labs(title="Bike Share Trips in Jersey City - Weekend vs Weekday, Aug 2023",
x="Hour",
y="Trip Counts") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Building onto that, we group this dataset by station and time of day
to visualize the total number of trips at each station during a specific
period of time on weekday or weekend. To simplify our visualization, we
selected the top 50 occurrences in our dataset. Note that here, each bar
represent an instance of station-time of day-weekday/weekend
combination. We are able to see that PM rush hour indeed sees that
greatest bikeshare demand for a couple of stations in particular.
Weekend rides only appear three times in the top 50 list. Zooming into
the detail reveals to us that Grove St PATH station
observes a total of 1117 rides during PM rush on a weekday, 801 rides
overnight on a weekday in August 2023. Hoboken Terminal - River
St & Hudson Pl station observes a total of 864 rides during
PM rush on weekday. South Waterfront Walkway - Sinatra Dr &
1 St station observes 319 rides overnight on a weekend and 731
rides during PM rush on a weekday in August 2023.
to_plot <- bike_census%>% # only show 100 stations
group_by(start_station_id, start_lat, start_lng, weekend, time_of_day, start_station_name) %>%
tally() %>%
arrange(-n) %>%
head(50)
to_plot$ID <- seq_along(to_plot$start_station_id)
to_plot %>%
arrange(-n) %>%
head(50) %>%
ggplot(aes(x = reorder(ID, -n), n, fill = time_of_day, color = weekend)) +
scale_fill_manual(values = palette4, name="Time of Day") +
guides(color="none") +
geom_bar(stat = "identity", position="stack") +
scale_color_manual(values = c("transparent", "black")) +
labs(title="Top 50 Occurences of Rides By Time")+
ylab("Number of Rides") +
xlab("") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Spatial Analysis
In addition to examining our data through temporal lens, we made some
maps to show the actual location of stations with the greatest bikeshare
demand by time of day and day of the week. This proportional map
illustrates that stations receive the highest ride demand during PM rush
and overnight in weekdays and the lowest demand during AM rush. Stations
with the highest bikeshare demand are generally located along the Hudson
River. This could be attributed to two reasons: there are simply more
stations in this part of Jersey City or Citi Bike has strategically
built bike stations along the Hudson River to make daily commute easier
for those who live in Jersey City but need to work/study in
Manhattan.
ggplot()+
geom_sf(data = jcTracts %>%
st_transform(crs=4326), fill = "white")+
geom_point(data = to_plot,
aes(x=start_lng, y = start_lat, size=n), color = alpha("#6E0E0A", 0.7)) +
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
scale_size(range = c(0.5, 8)) +
labs(size = "Number of Rides") +
facet_grid(weekend ~ time_of_day) +
labs(title="Locations of Stations with Greatest Bikeshare Demand by Time")+
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
We grouped all the starting stations together to check for spatial
autocorrelation between stations. We believe that stations with greater
demand are more likely to be located to other stations with greater
demand. In this case, we can use the demand of nearby stations at a
particular time to estimate the demand of our target station. We
selected 80 stations with the highest number of rides in August 2023 and
computed the Moran’s I with 999 random Monte Carlo simulation. The
results show that our observed Moran’s I is highly significant, implying
that there’s spatial autocorrelation between bikeshares at nearby
stations.
moran_data <- bike_census %>%
group_by(start_station_id, start_lat, start_lng, start_station_name) %>%
tally() %>%
arrange(-n) %>%
head(80) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326)
coords <- st_coordinates(moran_data)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
moranTest <- moran.mc(moran_data$n,
spatialWeights, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01, fill = "#124E78") +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
scale_x_continuous(limits = c(-0.5, 0.5)) +
labs(title = "Observed and Permuted Moran's I for Stations with Most Rides",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10))
This map below shows how stations with higher bikeshare demand are
clustered. In particular, there are two clusters of stations, and within
each cluster, there are a few stations with particularly large number of
rides. These stations are generally surrounded by stations with slightly
less rides.
ggplot() +
geom_sf(data = jcTracts, fill = "white") +
geom_sf(data = moran_data,
aes(size = n, color = n)) +
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
scale_size(
range = c(1, 6),
guide = guide_legend(
direction = "horizontal",
nrow = 1,
label.position = "bottom")) +
scale_color_gradientn(colours = hcl.colors(5, "RdBu",rev = TRUE, alpha = 0.9),
guide = guide_legend(direction = "horizontal", nrow = 1)) +
labs(title = "Stations with Most Rides",
subtitle = "Dot size proportional to rides") +
theme(legend.position = "bottom") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
This animated map below combines the spatial and temporal
characteristics of bike rides at 15min interval into a choropleth
map.
week33 <-
filter(bike_census , week == 33 & dotw == "Mon")
week33.panel <-
expand.grid(
interval15 = unique(week33$interval15),
Origin.Tract = unique(bike_census$Origin.Tract))
bike.animation.data <-
mutate(week33, Trip_Counter = 1) %>%
right_join(week33.panel) %>%
group_by(interval15, Origin.Tract) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
left_join(jcTracts, by=c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
Trip_Count > 10 ~ "11+ trips")) %>%
mutate(Trips = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
"7-10 trips","10+ trips"))
bikeshare_animation <-
ggplot() +
geom_sf(data = bike.animation.data, aes(fill = Trips)) +
scale_fill_manual(values = palette5) +
labs(title = "Bikeshare For One Day in Aug 2023",
subtitle = "15 minute intervals: {current_frame}") +
transition_manual(interval15) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
anim_save("animation.gif", animation = bikeshare_animation, end_pause = 4, height = 1000, width = 700, fps = 6, renderer = gifski_renderer())
Weather Analysis
On top of space and time, weather can be another significant factor
of in predicting bikeshare demand. Generally, we believe bikeshare
demand will decrease if it’s on a rainy, windy, or a very hot day in our
case. With that in mind, we downloaded the weather data using the
riem_measures
function for Newark airport between Aug 1,
2023 and Sept 1, 2023. Note that the Newark Airport station sufficiently
provides temporal weather for all of Jersey City. The weather.Panel is
generated to summarize temperature, precipitation, and wind speed for
every hour.
weather.Panel <-
riem_measures(station = "EWR", date_start = "2023-08-01", date_end = "2023-09-01") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
For the temperature, precipitation, and wind data we collected during
this month. We may see that there are a couple of rainy days in August.
Wind speed is pretty consistent throughout with a couple of really windy
days. Temperature fluctuates on a daily basis with higher temperature
during the day and cooler temperature at night.
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(color="#6E0E0A") +
labs(title="Percipitation", x="Hour", y="Perecipitation") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank()),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(color="#D74E09") +
labs(title="Wind Speed", x="Hour", y="Wind Speed") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank()),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(color="#124E78") +
labs(title="Temperature", x="Hour", y="Temperature") +theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank()),
top="Weather Data ")
Space-Time Panel
In the following steps, we created a study panel where each instance
in the panel is a unique combination of space and time. In other words,
each row will represent the ride at a particular station during a
particular hour.
study.panel <-
expand.grid(interval60=unique(bike_census$interval60),
start_station_id = unique(bike_census$start_station_id)) %>%
left_join(., bike_census %>%
select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
distinct() %>%
group_by(start_station_id) %>%
slice(1))
We need to add some more information to this panel. This includes
counting the number of rides at this station at this particular hour,
adding weather information, bringing in census data, and calculating
time and day of week using the lubridate
package.
bike.panel <-
bike_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station_id) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE) %>%
left_join(., njCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
Weather Correlation
Now, we could perform more exploratory analysis on our panel data to
prepare for our regression model. First, we would like to see how
temperature will correlate with the number of bike rides in August. The
figure below shows a significant, positive, strong, and linear
relationship between temperature and mean trip count for all weeks in
August. This means that as the temperature increase, bikeshare demand
will increase as well.
bike.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Temperature = first(Temperature)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Temperature, Trip_Count)) +
geom_point(color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Trip Count As a Fuction of Temperature by Week",
x="Temperature", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Next, we would like to see how wind speed will correlate with the
number of bike rides in August. We see that a linear and positive
relationship between these two variables, but the strength of such
relationship varies between different weeks. This generally means that
we see more bikeshare demand during windy days.
bike.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Wind = first(Wind_Speed)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Wind, Trip_Count)) +
geom_point(color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Trip Count As a Fuction of Wind by Week",
x="Wind Speed", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Serial Correlation
Should the number of trips exhibit serial (temporal) correlation,
then the time lag features will lead to better predictions. The logic
behind is that the number of trips at a particular hour will very much
correlate with the number of trips immediately before or after that
hour. If we know the number of trips at a particular hour, then it would
be easier for us to estimate the number of rides around that time. To
achieve that, we calculated 6 different forms of lag hours.
bike.panel <-
bike.panel %>%
arrange(start_station_id, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24)) %>%
mutate(day = yday(interval60))
The relationship between these lag hours and number of trips are
visualized. We see a strong, linear, and negative relationship between
lag12hours and number of trips 12 hours before. This means for example,
the number of rides we see at 5pm on a particular day is negatively
correlated with the number of rides at 5am that day. In contrast, we see
a strong, linear, and positive relationship between laghour and lag1day.
This means that the number of rides at a particular hour is highly
correlated with the number of rides in the immediate next hour and at
the same time the following day. Other than that, the strength of the
correlation decreases for lag2hour and lag3hour, meaning that the number
of rides at 4pm on a particular day, is more correlated with the number
of rides at 5pm, rather than at 6 or 7 pm.
as.data.frame(bike.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
ggplot(aes(x = Value , y = Trip_Count)) +
geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~Variable, ncol = 6)+
labs(x = "Variable", y = "Correlation", title = "Correlation Between Serial Lag Trips and Trip Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Spatial Correlation
We’ve seen before that spatial autocorrelation exists here in that
the number of trips at a particular station during a particular time is
correlated with the number of trips at nearby stations. This means that
if we know the number of trips at a station, we may be able to estimate
the number of trips at the station next to it. Do achieve that, we
created a new dataframe called lag_result
and wrote a
nested loop to loop through each hour in August 2023. For each hour, we
then calculate the nearest 5 neighbor of each station and the average of
rides for that 5 stations. The results were joined back to our panel
dataset.
lag_result <- data.frame()
for (d in 213:243) {
for (h in 0:23) {
spacelag <- bike.panel %>%
mutate(hour = hour(interval60)) %>%
filter(day == d & hour == h) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326)
coords.test_lag <- st_coordinates(spacelag)
neighborList.test_lag <- knn2nb(knearneigh(coords.test_lag, 5))
spatialWeights.test_lag <- nb2listw(neighborList.test_lag, style="W")
spacelag <- spacelag %>% mutate(lagTrip = lag.listw(spatialWeights.test_lag, Trip_Count))
lag_result <- rbind(lag_result, spacelag)
}
}
bike.panel <- lag_result %>%
mutate(start_lng = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2))) %>%
st_drop_geometry()
The figure below shows that for all weeks in August, there’s a
strong, linear, and positive relationship between the number of trips
and the number of trips in nearby stations. We will therefore include
this variable in our following regression analysis.
bike.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Lag_Count = mean(lagTrip)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Lag_Count, Trip_Count)) +
geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Correlation Between Spatial Lag Trips and Trip Count",
x="Lag Trip Count", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
OLS Regression
We split our 5 week data into training set, which includes 3 weeks
and testing set, which include 2 weeks. After that, four different
linear regression are estimaed on training data, each with different
fixed effects.
reg1
focuses on just time, including hour fixed
effects, day of the week, Wind Speed and Temperature.
reg2
focuses on just space, including station and
weather conditions.
reg3
focuses on the combined effect of space and
time
reg4
focuses adds the time and space lag features
bike.Train <- filter(bike.panel, week <= 33)
bike.Test <- filter(bike.panel, week > 33)
# Time
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature + Wind_Speed, data=bike.Train)
# Space
reg2 <-
lm(Trip_Count ~ start_station_name + Temperature + Wind_Speed, data=bike.Train)
# Time and Space
reg3 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed,
data=bike.Train)
# Time and Space and TimeLag and SpaceLag
reg4 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed +
lagHour + lag2Hours + lag12Hours + lag1day + lagTrip,
data=bike.Train)
Making Predictions
The purrr family of functions will be used to loop through a set of
‘nested’ data frames to efficiently compare across different model
specifications. Before testing our data, we nested our data by week and
created a model function. The first set of code separate our testing
data into two tibbles. The small function is created that takes a
tibble, dat and a regression model, fit as its inputs, and outputs
predictions as pred. This function is used simply for prediction.
bike.Test.weekNest <-
bike.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
The nested format allows one to loop through each model for each week
and mutate summary statistics. In the code block below, week_predictions
are calculated for each week in the testing set. The map function
applies the model_pred
function to each nested tibble. Take
the first line of the below mutate, for example. A new column, Time_FE,
includes predictions for reg1 - the time fixed effects model. The
predictions are created by mapping the function, model_pred, to each row
of data, parameterizing fit as the reg1 model.
week_predictions <-
bike.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_Lags = map(.x = data, fit = reg4, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
Error Analysis
Comparing the mean absolute errors for our four models, we find the
error to be the higest for model 1, which only takes time effect into
account. Model 2 and 3 account for space and space + time respectively,
which lead to an increase in accuracy level. Adding space lag and time
lag, however, will greatly improve the accuracy of our mode. We also ran
the variance of inflation function in the background and checked to make
sure that there’s no colinerarity in model 4.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette4) +
labs(title = "Mean Absolute Errors by Model Specification and Week") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
The figure below further demonstrates the accuracy of each model by
comparing predicted to observed number of trips. Model 4 clearly stands
out among the 4 models to have higher predicting power of bikeshare
demand.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id)) %>%
dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
scale_color_manual(values=palette2)+
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed Bike Share Time Series", subtitle = "Jersey City: a test set of 2 weeks", x = "Hour", y= "Station Trips") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
We also need to check if our predictions generalize across space and
time. To do that, we mapped our mean absolute error of model 4 across
space. We can see that this model does not do a better job in predicting
number of rides at stations along the Hudson River, which are typically
stations with higher bikeshare demands during both weekend and weekdays.
Other than that, this model does a pretty good job in predicting
bikeshare demands at stations with lower number of trips in August.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
start_latitude = map(data, pull, start_lat),
start_longitude = map(data, pull, start_lng)) %>%
select(interval60, start_station_id, start_longitude, start_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_Lags") %>%
group_by(start_station_id, start_longitude, start_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = jcTracts, fill = "white")+
geom_point(aes(x = start_longitude, y = start_latitude, color = MAE),
fill = "transparent")+
scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
labs(title="Mean Abs Error, Test Set, Model 4") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
Comparing across space and time, the figure below shows that model 4 has
been pretty consistent in predicting number of trips across time.
Stations that are predicted to have higher demand in the weekend will
continue to be predicted to have higher demand during the weekdays.
Errors carry throughout time. This implies that stations that
historically have higher demand have higher variability in the number of
rides and are more difficult to predict with the current predictors we
have. In fact, our current predictors only explains 50.12% of the
variability. What’s problematic about this is that if we continue to
have larger errors when predicting stations with higher bikeshare
demand, we will largely fail to meet the demand of bike users at these
stations.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
start_latitude = map(data, pull, start_lat),
start_longitude = map(data, pull, start_lng),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station_id, start_longitude,
start_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_Lags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station_id, weekend, time_of_day, start_longitude, start_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = jcTracts, color = "grey", fill = "transparent")+
geom_point(aes(x = start_longitude, y = start_latitude, color = MAE),
fill = "transparent", size = 0.5)+
scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors by Time, Test Set, Model 4") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
Cross-Validation
Finally, we performed 100 cross-validation tests for all five weeks
of data using model 4. The MAE value is 0.44, indicating that despite
being our best solution so far, the model’s prediction error can still
be relatively large. It will also be good practice to cross validate
across socio-economic indicators to see if bike stations’ demand within
certain neighborhoods are disproportionately under/over predicted, but
our model current does not take into account any socioeconomic variables
as predictor variable and the scenario in Jersey City is that more
stations are built
fitControl <- trainControl(method = "cv", number = 100)
reg.cv <- train(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed + lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, data=bike.panel, method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv$resample %>%
summarise(MAE = mean(reg.cv$resample[,3]),
sd(reg.cv$resample[,3])
) %>%
kbl(col.name=c('Mean Absolute Error','Standard Deviation of MAE')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Mean Absolute Error
|
Standard Deviation of MAE
|
0.4484213
|
0.0311512
|
---
title: "Predicting Summer Bikeshare Demand in Jersey City"
author: "Emily Zhou"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: simplex
    toc: yes
    toc_float: yes
    code_folding: hide
    code_download: yes
editor_options:
  markdown:
    wrap: sentence
---

Version 1.0 \| First Created Nov 4, 2023 \| Updated Nov 15, 2023

Keywords: Transportation Planning, Rideshare Forecast, Spatial and Serial Autocorrelation, Nested Tibble, OLS Regression, time lag, gganimate, purrr


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Introduction

Bikeshare programs have become an integral part of urban transportation system, providing a sustainable and convenient alternative to traditional commuting methods. Citi Bike is one of the earliest bicycle sharing system created by Lyft in 2008 at NYC. But in recent years, this program also extends to Jersey City across the Hudson River in order to cater to the dynamic mobility needs of residents who had to commute to Manhattan for work. As the popularity of Citi Bike continues to soar, there arises a pressing need for effective management strategies to ensure the availability of bikes at each station in Jersey City. 

Bikesharing (and ridesharing) inherently involves the challenge of maintaining a delicate equilibrium in bike distribution, ie. "re-balancing". This issue arises from the spatial and temporal variations in user demand. If certain stations experience a surge in demand during morning rush hours while others witness a decline, this could result in station clusters with depleted or surplus bike stocks, thereby compromising user experience. Companies like Uber & Lyft generate and analyze tremendous amounts of data to incentivize bike (ride) share use, solve routing problems, calculate pricing, and obviously forecast demands, but they are not very transparent about the algorithm they use. Therefore, the goal of this report is keep a simple, but open source version of bikeshare demand prediction, by harnessing of power of machine learning, particularly the Ordinary Least Square (OLS) regression to rapidly predict the demand for bikes across various stations in Jersey City. We will perform analysis on the historical data on bike usage at different stations and different time as well as under different weather conditions. The forecast allows the Citi Bike program to proactively address imbalances, strategically re-positioning bikes in anticipation of peak usage time or special events. 

The subsequent sections on this report is organized as follows. Firstly, we load in our datasets for a quick examinations and perform some necessary feature engineering. Secondly, we conduct exploratory analysis on the temporal and spatial aspects of bikeshare data. We will also look at some weather data in this process. Thirdly, we create a space-time panel where each row is a unique instance of time-space combination of rides and visualize the correlation between space, time, weather, and number of rides. And finally, we make created four different models, test their sensibility to errors, and conduct cross validation. 

The GitHub repository for this report is [here](https://github.com/emilyzhou112/MUSA5080-NJ-BikeShare).

# Set Up

```{r library and data, include=FALSE}
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(mapview)
library(sfdep)
library(spdep)
library(caret)
library (readr)

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#124E78","#F0F0C9","#F2BB05","#D74E09","#6E0E0A")
palette4 <- c("#124E78","#F0F0C9","#F2BB05","#6E0E0A")
palette2 <- c("#124E78","#6E0E0A")

tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
```

The data we use here is obtained from [NYC OpenData](https://s3.amazonaws.com/tripdata/index.html) that contains the latest record of Citi Bike usage on a monthly base. Note that while Jersey City is located in New Jersey, Citi Bike is mainly operated in NYC and therefore its dataset is also operated by NYC. We selected all bikeshare data in Jersey City from Aug 1 2023 to Aug 31 2023 as our temporal frame. The dataset mainly includes locations and names of stations of which the user started the ride as well as end the ride. In addition, we also downloaded the census tract geometries of the entire Hudson County via tidycensus, which includes Jersey City. 

```{r obtain census geom and bike data, warning=FALSE, message=FALSE, results='hide', cache=TRUE}

nycbike <- read.csv(url("https://raw.githubusercontent.com/emilyzhou112/MUSA5080-NJ-BikeShare/main/JC-202308-citibike-tripdata.csv"))

njCensus <- get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2021, 
          state = "NJ", 
          geometry = TRUE, 
          county=c("Hudson"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

jcTracts <- njCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf()
```

Since our raw bikeshare data is a non-spatial layer, despite containing spatial attributes, our next step is to make it spatial by joining it to the census geometry. We are only interested in the start location of each ride because we want to predict the capacity of each station when the user need a bike to start the ride. This requires us to filter out rows with missing latitude and longitude. We also found that some stations have erroneous longitude and latitude coding that lead to these points being located in the Hudson River. These points also need to be removed. 

```{r add census tract}

bike_census <- st_join(nycbike %>% 
                         filter(is.na(start_lat) == FALSE &
                                is.na(start_lng) == FALSE &
                                is.na(end_lat) == FALSE &
                                is.na(end_lng) == FALSE) %>%
                         st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
                       jcTracts %>%  st_transform(crs=4326),
                       join=st_intersects, left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  filter(!(ride_id %in% c("BEC9CC4D1007C753", "86087F431785EDE7", "A1AAB92631439883", "71938DA085242DCC", "4203C77668C47C75", "C92410CA2AEF3947", "6E02CBF36C90F244", "0259885253FD238E", "B39136B37AFB5FE9", "1EC9376D1559C51C", "CDDF16D3A37BE370"))) %>% 
  as.data.frame()
  
```

The following visualization shows all the steps we did above. 

```{r distribution of stations}

ggplot()+
  geom_sf(data = jcTracts %>%
          st_transform(crs=4326), fill="white")+
  geom_point(data = bike_census, aes(x=start_lng, y = start_lat), 
            color = "#6E0E0A", alpha = 1, size = 0.3) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  labs(title = "Location of Rides Starting Points in Jersey City") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

```

We will perform some basic feature engineering using R's powerful `lubridate` package to extract some temporal information for each ride. This includes calculating the 60min and 15min interval for the starting time of each ride, and categorizing these intervals into by the time of day into "Overnight", "AM Rush", "Mid-Day", and "PM Rush" and by day of week into weekday and weekend. 

```{r temporal feature engineering}

bike_census <- bike_census %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>% 
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>% 
  mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))
```

Interested in seeing the commuting patterns to get a sense of rider's daily behavior, we created line strings that connect between the start and end locations of each ride. For the simplicity of visualization, we only visualize 200 rides. We may see that most of the rides occur within Jersey City, but there are also people who commute using Citi Bike from Jersey City to Manhattan.  

```{r origin-desintation matrix, warning=FALSE}

lines <- st_sfc(
  lapply(1:nrow(bike_census), function(i) {
    st_linestring(matrix(c(bike_census[i, "start_lng"], bike_census[i, "start_lat"],
                            bike_census[i, "end_lng"], bike_census[i, "end_lat"]), ncol = 2, byrow = TRUE))
  }))

lines_sf <- st_sf(geometry = lines)
lines_sf <- st_set_crs(lines_sf, 4326)

Route <- lines_sf %>% head(200)
mapview(Route, zcol = NULL, color = "#6E0E0A", linewidth = 0.001, alpha = 0.3) # only show 200
```


# Exploratory Analysis

In this section, we delve into some further exploratory analysis of our data set. 

## Temporal Analysis

Starting with temporal analysis, we first visualized the bikeshare trips per hour in Jersey City in August 2023. It is clear that there's some form of circularity in that there are both peak hours and non-peak hours during a single day. Also, we may see that five consecutive higher peaks are usually followed by two less higher peaks. These correspond to high bikeshare need during rush hours and low bikeshare needs overnight, in early mornings, and over the weekend

```{r bikeshare viz1}
ggplot(bike_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n), color= "#6E0E0A")+
  labs(title="Bike Share Trips Per Hour in Jersey City, Aug 2023",
       x="Date", 
       y="Number of trips") + 
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
  
```

Then, we visualized the distribution of trips by stations at each hour in August. We may see that the distribution of rides is highly right skewed, with most stations having zero or a few number of trips each hour and a couple of stations having much more rides, up to 40 or even 50 rides at a particular hour on a particular day in August 2023. 

```{r bikeshare viz3, warning=FALSE}

ggplot(bike_census %>%
         group_by(interval60, start_station_name) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5, fill="#6E0E0A")+
  labs(title="Bike Share Trips Per Hour by Station in Jersey City, Aug 2023",
       x="Trip Counts", 
       y="Number of Stations") + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

```

We continue to visualize the distribution of mean number of trips at each station during different times of the day. We recognize that during PM rush, the average number of trips has been the highest throughout, with a couple of stations' average trip exceed 10, implying a relatively high demand for a single station. This is followed by ride demand in mid day, when some stations can have more than 5 rides. It is also surprising to notice that the demand is pretty high overnight, with a few stations having more than 5 rides. 

```{r bikeshare viz2, warning=FALSE, message=FALSE}

bike_census %>%
        group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
  scale_fill_manual(values = palette4, name="Time of Day") + 
  labs(title="Mean Number of Hourly Trips Per Station",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

After that, we visualized the total number of trips in August separated by day of the week. The two main patterns here are that firstly, trip numbers vary by time of the day. Lowest trip counts occur in early morning while highest trip count occur in late afternoon. Secondly, Tuesday, Wednesday, and Thursday tend to observe higher number of trips. 

```{r bikeshare viz4}

ggplot(bike_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 3, lwd = 0.8)+
  scale_color_manual(values = c("#124E78", "#819FA1", "#F0F0C9", "#F2BB05", "#E58507", "#D74E09", "#6E0E0A"), name = "Day") + 
  labs(title="Bike Share Trips in Jersey City by Day of the Week, Aug 2023",
       x="Hour", 
       y="Trip Counts") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

Same type of visualize but this time separated between weekend and weekdays. It is very obvious that we see more trips during weekdays and less trips during weekends. Also, during weekends, number of trips start to increase much later than during weekdays, suggesting the missing of morning rush hour will significantly impact people's use of bikeshare services. 

```{r bikeshare viz 5}

ggplot(bike_census)+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 3, lwd = 1.2)+
  scale_color_manual(values=palette2) + 
  labs(color = "Time of Week") +
  labs(title="Bike Share Trips in Jersey City - Weekend vs Weekday, Aug 2023",
       x="Hour", 
       y="Trip Counts") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

Building onto that, we group this dataset by station and time of day to visualize the total number of trips at each station during a specific period of time on weekday or weekend. To simplify our visualization, we selected the top 50 occurrences in our dataset. Note that here, each bar represent an instance of station-time of day-weekday/weekend combination. We are able to see that PM rush hour indeed sees that greatest bikeshare demand for a couple of stations in particular. Weekend rides only appear three times in the top 50 list. Zooming into the detail reveals to us that **Grove St PATH** station observes a total of 1117 rides during PM rush on a weekday, 801 rides overnight on a weekday in August 2023. **Hoboken Terminal - River St & Hudson Pl** station observes a total of 864 rides during PM rush on weekday. **South Waterfront Walkway - Sinatra Dr & 1 St** station observes 319 rides overnight on a weekend and 731 rides during PM rush on a weekday in August 2023. 

```{r bikeshare viz 7, fig.width=10}

to_plot <- bike_census%>% # only show 100 stations
  group_by(start_station_id, start_lat, start_lng, weekend, time_of_day, start_station_name) %>%
  tally() %>% 
  arrange(-n) %>% 
  head(50) 
to_plot$ID <- seq_along(to_plot$start_station_id)

to_plot %>% 
  arrange(-n) %>%
  head(50) %>% 
  ggplot(aes(x = reorder(ID, -n), n, fill = time_of_day, color = weekend)) +
  scale_fill_manual(values = palette4, name="Time of Day") + 
  guides(color="none") + 
  geom_bar(stat = "identity", position="stack") +
  scale_color_manual(values = c("transparent", "black")) +
  labs(title="Top 50 Occurences of Rides By Time")+
  ylab("Number of Rides") +
  xlab("") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
```


## Spatial Analysis

In addition to examining our data through temporal lens, we made some maps to show the actual location of stations with the greatest bikeshare demand by time of day and day of the week. This proportional map illustrates that stations receive the highest ride demand during PM rush and overnight in weekdays and the lowest demand during AM rush. Stations with the highest bikeshare demand are generally located along the Hudson River. This could be attributed to two reasons: there are simply more stations in this part of Jersey City or Citi Bike has strategically built bike stations along the Hudson River to make daily commute easier for those who live in Jersey City but need to work/study in Manhattan. 

```{r bikeshare viz 6}

ggplot()+
  geom_sf(data = jcTracts %>%
          st_transform(crs=4326), fill = "white")+
  geom_point(data = to_plot,
             aes(x=start_lng, y = start_lat, size=n), color = alpha("#6E0E0A", 0.7)) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  scale_size(range = c(0.5, 8)) + 
  labs(size = "Number of Rides") +
  facet_grid(weekend ~ time_of_day) + 
    labs(title="Locations of Stations with Greatest Bikeshare Demand by Time")+ 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

We grouped all the starting stations together to check for spatial autocorrelation between stations. We believe that stations with greater demand are more likely to be located to other stations with greater demand. In this case, we can use the demand of nearby stations at a particular time to estimate the demand of our target station. We selected 80 stations with the highest number of rides in August 2023 and computed the Moran's I with 999 random Monte Carlo simulation. The results show that our observed Moran's I is highly significant, implying that there's spatial autocorrelation between bikeshares at nearby stations.  

```{r spatial auto cor, warning=FALSE}

moran_data <- bike_census %>%
  group_by(start_station_id, start_lat, start_lng, start_station_name) %>%
  tally() %>% 
  arrange(-n) %>% 
  head(80) %>% 
  st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326)


coords <-  st_coordinates(moran_data) 
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")

moranTest <- moran.mc(moran_data$n, 
                      spatialWeights, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01, fill = "#124E78") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
  scale_x_continuous(limits = c(-0.5, 0.5)) +
  labs(title = "Observed and Permuted Moran's I for Stations with Most Rides",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10))
```

This map below shows how stations with higher bikeshare demand are clustered. In particular, there are two clusters of stations, and within each cluster, there are a few stations with particularly large number of rides. These stations are generally surrounded by stations with slightly less rides. 

```{r spatial cor viz, fig.height=8, fig.width=11}

  ggplot() +
  geom_sf(data = jcTracts, fill = "white") +
  geom_sf(data = moran_data,
          aes(size = n, color = n)) + 
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) + 
  scale_size(
    range = c(1, 6),
    guide = guide_legend(
      direction = "horizontal",
      nrow = 1,
      label.position = "bottom")) +
  scale_color_gradientn(colours = hcl.colors(5, "RdBu",rev = TRUE, alpha = 0.9), 
                        guide = guide_legend(direction = "horizontal", nrow = 1)) +
  labs(title = "Stations with Most Rides",
       subtitle =  "Dot size proportional to rides") + 
  theme(legend.position = "bottom") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

```

This animated map below combines the spatial and temporal characteristics of bike rides at 15min interval into a choropleth map. 

```{r animate map temporal viz, message=FALSE, warning=FALSE, cache=TRUE, fig.show='animate'}

week33 <-
  filter(bike_census , week == 33 & dotw == "Mon")

week33.panel <-
  expand.grid(
    interval15 = unique(week33$interval15),
    Origin.Tract = unique(bike_census$Origin.Tract))

bike.animation.data <-
  mutate(week33, Trip_Counter = 1) %>%
    right_join(week33.panel) %>% 
    group_by(interval15, Origin.Tract) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    left_join(jcTracts, by=c("Origin.Tract" = "GEOID")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","10+ trips"))

bikeshare_animation <-
  ggplot() +
    geom_sf(data = bike.animation.data, aes(fill = Trips)) +
    scale_fill_manual(values = palette5) +
    labs(title = "Bikeshare For One Day in Aug 2023",
         subtitle = "15 minute intervals: {current_frame}") +
    transition_manual(interval15)  + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
  


anim_save("animation.gif", animation = bikeshare_animation, end_pause = 4, height = 1000, width = 700, fps = 6, renderer = gifski_renderer())



```

![Animated Map for Bikeshare Demand in Week 33](animation.gif)


## Weather Analysis

On top of space and time, weather can be another significant factor of in predicting bikeshare demand. Generally, we believe bikeshare demand will decrease if it's on a rainy, windy, or a very hot day in our case.  With that in mind, we downloaded the weather data using the `riem_measures` function for Newark airport between Aug 1, 2023 and Sept 1, 2023. Note that the Newark Airport station sufficiently provides temporal weather for all of Jersey City. The weather.Panel is generated to summarize temperature, precipitation, and wind speed for every hour. 

```{r download and wrangle weather data, warning=FALSE}

weather.Panel <- 
  riem_measures(station = "EWR", date_start = "2023-08-01", date_end = "2023-09-01") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

```

For the temperature, precipitation, and wind data we collected during this month. We may see that there are a couple of rainy days in August. Wind speed is pretty consistent throughout with a couple of really windy days. Temperature fluctuates on a daily basis with higher temperature during the day and cooler temperature at night. 

```{r visualize weather data, fig.height=11, fig.width=8, warning=FALSE}

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(color="#6E0E0A") + 
    labs(title="Percipitation", x="Hour", y="Perecipitation") + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()), 
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(color="#D74E09") + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()),  
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(color="#124E78") + 
    labs(title="Temperature", x="Hour", y="Temperature") +theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank()), 
  top="Weather Data ")

```


# Space-Time Panel

In the following steps, we created a study panel where each instance in the panel is a unique combination of space and time. In other words, each row will represent the ride at a particular station during a particular hour. 

```{r create study panel, warning=FALSE, message=FALSE}

study.panel <- 
  expand.grid(interval60=unique(bike_census$interval60), 
              start_station_id = unique(bike_census$start_station_id)) %>%
  left_join(., bike_census %>%
              select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))
```

We need to add some more information to this panel. This includes counting the number of rides at this station at this particular hour, adding weather information, bringing in census data, and calculating time and day of week using the `lubridate` package.  

```{r add info to study panel, warning=FALSE, message=FALSE}

bike.panel <- 
  bike_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE) %>% 
  left_join(., njCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

```

## Weather Correlation

Now, we could perform more exploratory analysis on our panel data to prepare for our regression model. First, we would like to see how temperature will correlate with the number of bike rides in August. The figure below shows a significant, positive, strong, and linear relationship between temperature and mean trip count for all weeks in August. This means that as the temperature increase, bikeshare demand will increase as well. 

```{r visualize temp, warning=FALSE, message=FALSE}

bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point(color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count As a Fuction of Temperature by Week",
         x="Temperature", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
  

```

Next, we would like to see how wind speed will correlate with the number of bike rides in August. We see that a linear and positive relationship between these two variables, but the strength of such relationship varies between different weeks. This generally means that we see more bikeshare demand during windy days. 

```{r visualize wind, warning=FALSE, message=FALSE}
bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Wind = first(Wind_Speed)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Wind, Trip_Count)) + 
    geom_point(color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Trip Count As a Fuction of Wind by Week",
         x="Wind Speed", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) + 
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
```


## Serial Correlation

Should the number of trips exhibit serial (temporal) correlation, then the time lag features will lead to better predictions. The logic behind is that the number of trips at a particular hour will very much correlate with the number of trips immediately before or after that hour. If we know the number of trips at a particular hour, then it would be easier for us to estimate the number of rides around that time. To achieve that, we calculated 6 different forms of lag hours. 

```{r calculate serial lag}

bike.panel <- 
  bike.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval60))

```

The relationship between these lag hours and number of trips are visualized. We see a strong, linear, and negative relationship between lag12hours and number of trips 12 hours before. This means for example, the number of rides we see at 5pm on a particular day is negatively correlated with the number of rides at 5am that day. In contrast, we see a strong, linear, and positive relationship between laghour and lag1day. This means that the number of rides at a particular hour is highly correlated with the number of rides in the immediate next hour and at the same time the following day. Other than that, the strength of the correlation decreases for lag2hour and lag3hour, meaning that the number of rides at 4pm on a particular day, is more correlated with the number of rides at 5pm, rather than at 6 or 7 pm. 

```{r visualize serial correlations, message=FALSE, warning=FALSE}

as.data.frame(bike.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
  ggplot(aes(x = Value , y = Trip_Count)) +
    geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
  facet_wrap(~Variable, ncol = 6)+
  labs(x = "Variable", y = "Correlation", title = "Correlation Between Serial Lag Trips and Trip Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

## Spatial Correlation

We've seen before that spatial autocorrelation exists here in that the number of trips at a particular station during a particular time is correlated with the number of trips at nearby stations. This means that if we know the number of trips at a station, we may be able to estimate the number of trips at the station next to it. Do achieve that, we created a new dataframe called `lag_result` and wrote a nested loop to loop through each hour in August 2023. For each hour, we then calculate the nearest 5 neighbor of each station and the average of rides for that 5 stations. The results were joined back to our panel dataset. 

```{r calculate spatial lag, message=FALSE, warning=FALSE}

lag_result <- data.frame()

for (d in 213:243) {
  for (h in 0:23) {

spacelag <- bike.panel %>% 
  mutate(hour = hour(interval60)) %>% 
  filter(day == d & hour == h) %>% 
  st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326) 


coords.test_lag <-  st_coordinates(spacelag) 
neighborList.test_lag <- knn2nb(knearneigh(coords.test_lag, 5))
spatialWeights.test_lag <- nb2listw(neighborList.test_lag, style="W")

spacelag <- spacelag %>% mutate(lagTrip = lag.listw(spatialWeights.test_lag, Trip_Count))

lag_result <- rbind(lag_result, spacelag)
  }
  
}

bike.panel <- lag_result %>% 
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2))) %>% 
 st_drop_geometry() 
  
  
```

The figure below shows that for all weeks in August, there's a strong, linear, and positive relationship between the number of trips and the number of trips in nearby stations. We will therefore include this variable in our following regression analysis. 

```{r visualize spatial correlation, message=FALSE, warning=FALSE}

bike.panel %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Lag_Count = mean(lagTrip)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Lag_Count, Trip_Count)) + 
    geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
    facet_wrap(~week, ncol=8) + 
    labs(title="Correlation Between Spatial Lag Trips and Trip Count",
         x="Lag Trip Count", y="Mean Trip Count") + 
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```


# OLS Regression

We split our 5 week data into training set, which includes 3 weeks and testing set, which include 2 weeks. After that, four different linear regression are estimaed on training data, each with different fixed effects. 

- `reg1` focuses on just time, including hour fixed effects, day of the week, Wind Speed and Temperature.
- `reg2` focuses on just space, including station and weather conditions. 
- `reg3` focuses on the combined effect of space and time
- `reg4` focuses adds the time and space lag features


```{r fitting regression}

bike.Train <- filter(bike.panel, week <= 33)
bike.Test <- filter(bike.panel, week > 33)

# Time
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature + Wind_Speed, data=bike.Train)

# Space
reg2 <- 
  lm(Trip_Count ~  start_station_name + Temperature + Wind_Speed,  data=bike.Train)

# Time and Space
reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed, 
     data=bike.Train)

# Time and Space and TimeLag and SpaceLag
reg4 <- 
  lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Wind_Speed +
                   lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, 
     data=bike.Train)
```

## Making Predictions

The purrr family of functions will be used to loop through a set of 'nested' data frames to efficiently compare across different model specifications. Before testing our data, we nested our data by week and created a model function. The first set of code separate our testing data into two tibbles. The small function is created that takes a tibble, dat and a regression model, fit as its inputs, and outputs predictions as pred. This function is used simply for prediction. 

```{r nest data and predict function, warning=FALSE}

bike.Test.weekNest <- 
  bike.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

```


The nested format allows one to loop through each model for each week and mutate summary statistics. In the code block below, week_predictions are calculated for each week in the testing set. The map function applies the `model_pred` function to each nested tibble. Take the first line of the below mutate, for example. A new column, Time_FE, includes predictions for reg1 - the time fixed effects model. The predictions are created by mapping the function, model_pred, to each row of data, parameterizing fit as the reg1 model.

```{r making predictions, warning=FALSE}

week_predictions <- 
  bike.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_Lags = map(.x = data, fit = reg4, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

```

## Error Analysis

Comparing the mean absolute errors for our four models, we find the error to be the higest for model 1, which only takes time effect into account. Model 2 and 3 account for space and space + time respectively, which lead to an increase in accuracy level. Adding space lag and time lag, however, will greatly improve the accuracy of our mode. We also ran the variance of inflation function in the background and checked to make sure that there's no colinerarity in model 4. 

```{r plot errors}

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette4) +
    labs(title = "Mean Absolute Errors by Model Specification and Week") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```
The figure below further demonstrates the accuracy of each model by comparing predicted to observed number of trips. Model 4 clearly stands out among the 4 models to have higher predicting power of bikeshare demand. 

```{r comparing predicted and observed, fig.height=8, fig.width=11, message=FALSE, warning=FALSE, cache=TRUE}

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      scale_color_manual(values=palette2)+
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed Bike Share Time Series", subtitle = "Jersey City: a test set of 2 weeks",  x = "Hour", y= "Station Trips") + 
   theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), 
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```

We also need to check if our predictions generalize across space and time. To do that, we mapped our mean absolute error of model 4 across space. We can see that this model does not do a better job in predicting number of rides at stations along the Hudson River, which are typically stations with higher bikeshare demands during both weekend and weekdays. Other than that, this model does a pretty good job in predicting bikeshare demands at stations with lower number of trips in August. 

```{r error by station, warning=FALSE, message=FALSE, cache=TRUE}

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_latitude = map(data, pull, start_lat), 
           start_longitude = map(data, pull, start_lng)) %>%
    select(interval60, start_station_id, start_longitude, start_latitude, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_Lags") %>%
  group_by(start_station_id, start_longitude, start_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = jcTracts,  fill = "white")+
  geom_point(aes(x = start_longitude, y = start_latitude, color = MAE), 
             fill = "transparent")+
  scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
 ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  labs(title="Mean Abs Error, Test Set, Model 4") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

```
Comparing across space and time, the figure below shows that model 4 has been pretty consistent in predicting number of trips across time. Stations that are predicted to have higher demand in the weekend will continue to be predicted to have higher demand during the weekdays. Errors carry throughout time. This implies that stations that historically have higher demand have higher variability in the number of rides and are more difficult to predict with the current predictors we have. In fact, our current predictors only explains 50.12% of the variability. What's problematic about this is that if we continue to have larger errors when predicting stations with higher bikeshare demand, we will largely fail to meet the demand of bike users at these stations. 

```{r comparing errors across space and time, warning=FALSE, message=FALSE, cache=TRUE}

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_latitude = map(data, pull, start_lat), 
           start_longitude = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, start_longitude, 
           start_latitude, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_Lags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_longitude, start_latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = jcTracts, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_longitude, y = start_latitude, color = MAE), 
             fill = "transparent", size = 0.5)+
  scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
 ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors by Time, Test Set, Model 4") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```


## Cross-Validation

Finally, we performed 100 cross-validation tests for all five weeks of data using model 4. The MAE value is 0.44, indicating that despite being our best solution so far, the model's prediction error can still be relatively large. It will also be good practice to cross validate across socio-economic indicators to see if bike stations' demand within certain neighborhoods are disproportionately under/over predicted, but our model current does not take into account any socioeconomic variables as predictor variable and the scenario in Jersey City is that more stations are built 

```{r cross validation}

fitControl <- trainControl(method = "cv", number = 100)

reg.cv <- train(Trip_Count ~ start_station_name +  hour(interval60) + dotw + Temperature + Wind_Speed + lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, data=bike.panel, method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv$resample %>% 
  summarise(MAE = mean(reg.cv$resample[,3]),
            sd(reg.cv$resample[,3])
) %>%
  kbl(col.name=c('Mean Absolute Error','Standard Deviation of MAE')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```


# Conclusion

In conclusion, this report addresses the critical need for effective bikeshare management strategies in Jersey City by employing machine learning techniques. Our analysis shows that spatial, temporal, and weather-related factors all collectively contribute to the variations in bikeshare demands. We found that it is more accurate to predict bikeshare demand at a particular hour based on the number of trips at the hour immediately before and after, during the same hour a day before, as well as the number of trips at nearby stations. We also found that bikeshare demand are the highest during windy and hot days in Jersey City in August.  The PM rush and overnight during weekdays see the highest demand while the PM rush during weekend sees the highest demand. On top of that, residents of Jersey City uses Citi Bike heavily for work commute to New York City, which has made stations along the Hudson River, particular those closer to ferry stations, the most popular. 

We did some further investigations into stations with greatest bikeshare demand. **Grove St PATH** station, which observes a total of 1117 rides during PM rush on weekday and 801 rides overnight in August, is located in the busy downtown Jersey City.**Hoboken Terminal - River St & Hudson Pl**, which observes a total of 864 rides during PM rush on weekday, is located right next to Hoboken/NJ Transit Terminal and along the Hudson River. **South Waterfront Walkway - Sinatra Dr & 1 St** station, which observes 319 rides overnight on a weekend and 731 rides during PM rush on a weekday is within walking distance to the NJ Transit Terminal and at the entrance to Pier A Park, a waterfront recreation area in Jersey City. Therefore, we may conclude that stations receive higher demand because 1) there's more workers/visitors in downtown area who need to bike to get around. 2) people need to commute to New York for work and need to bike to ferry/transit stations, 3) waterfront recreation areas is great for biking activities over the weekend. Local government should indeed pay more attention to the "supply" and "demand of bikes at those kind of stations and rebalance bikes if needed be. 

Our model is able to capture more than half of the bikeshare demand in Jersey City, though still, it requires improvement to capture more variability. Otherwise, it will fail to meet the needs of Jersey City residents and  increase the difficulties of rebalancing bikes. Nevertheless, the transparency and simplicity of this open-source approach provide a valuable resource for bikeshare programs to enhance user experience and maintain a harmonious bike distribution across stations. As urban mobility continues to evolve, the application of such predictive models becomes increasingly essential for sustainable and efficient transportation solutions.

