Predict bike share trips in Philadelphia

1. Introduction

Bike share is that public bicycles, usually next to subways, bus stations, parks, residential and shopping area, are provided for individuals to satisfy their short-term travelling demand by paying rent according to their travelling time. Nowadays it has been widely developed in many metropolitan regions, such as Philadelphia and NYC, to bring travelers with convenience and health benefits, and contribute to the environment their live.
Despite such benefits, during peak travelling time when people commute to school or work, commuters usually can not find a sharing bike, or do not know where to return the bike since bike stations can be either full vacant during this time, which will cause delay and inconvenience for their travelling. The spatial and time distribution of travellers’ demand is extremely uneven. To prevent such problems, organizers of bike share system should allocate the bike sharing rationally and evenly. This is called rebalancing, a key issue in bike-sharing system (Tian et al. 2020).
For organizers, the demand of travellers should be matched with the supply of bike share timely as possible. This requires in-depth studies of travellers’ characteristics according to their spatial and temporal distribution with bike sharing. Under this context, this study forecasts the ride share demand in Philadelphia based on the start and end station of each bike-sharing traveller. The prediction results will help organizers make better decisions for the management of bike share system in Philadelphia.

2. Data Wrangling

2.1 Bike share

This study choose the Indego bike share data from October 8th to November 11st, a 5-week data of the year 2019 in Philadelphia. We want to ignore the effects of holidays that may disrupt the expected demand during a given weekend or weekday, so there is no national holiday in the study period we chose. Trips generated from starting stations are only considered in this study. The travel interval of 60 minute and 15 minute is considered in our study.

ride.PA<-read.csv("C:/Users/zheng/Desktop/508 ridership/indego-trips-2019-q4.csv/indego-trips-2019-q4.csv")
ride2 <-
  ride.PA %>% 
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))%>%
  filter(week>=41 & week <= 45)

2.2 Weather Data

We collect weather data, including temperature, wind speed, precipitation on hourly basis in our study.

weather.PA <- 
  riem_measures(station = "PHL", date_start = "2019-10-08", date_end = "2019-11-12")

weather.Panel <-  
  weather.PA %>%
  mutate_if(is.character, list(~replace(as.character(.), is.na(.), "0"))) %>% 
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Percipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

2.3 Amenity features

The distance to the public services, such as bus stops, picnic sites, schools and playgrounds, affect people’s travelling mode. We assume that bike sharing is frequent in recreational sites. Work or study places as well as public transportation is the destination for many travellers. Under this premise, some amenities features are gathered and the average nearest neighbor distance to them from each station is calculated. The distance to the nearest three amenities is chosen in our study.

#bus stops
bus_stops <- st_read("https://opendata.arcgis.com/datasets/e09e9f98bdf04eada214d2217f3adbf1_0.geojson")%>%
  dplyr::select(Y = Longitude, X = Latitude) %>%
  na.omit() %>%
  st_as_sf(coords = c("X", "Y"), crs = 4236, agr = "constant")

#picnic sites
picnic<-st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+ppr_picnic_sites&filename=ppr_picnic_sites&format=geojson&skipfields=cartodb_id")%>%
  dplyr::select(geometry) %>%
  na.omit() 

#schools
school<-st_read("https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson")%>%
  dplyr::select(geometry) %>%
  na.omit() 

#playground
playground<-st_read("https://opendata.arcgis.com/datasets/899c807e205244278b3f39421be8489c_0.geojson")%>%
  dplyr::select(geometry) %>%
  na.omit() 

phl.ride <- ride2 %>%
  na.omit() %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326, agr = "constant")

st_c <- st_coordinates

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

phl.ride <-
  phl.ride %>%
  mutate(
    bus_stops.nn =
      nn_function(st_c(phl.ride), st_c(bus_stops),3),
    school.nn =
      nn_function(st_c(phl.ride), st_c(school),3),
    picnic.nn =
      nn_function(st_c(phl.ride), st_c(picnic),3),
    playground.nn =
      nn_function(st_c(phl.ride), st_c(playground),3))

phl.ride <-
  phl.ride%>%
  select(start_station, end_station, bus_stops.nn, school.nn, picnic.nn, playground.nn,
         interval60, interval15, week, dotw)

2.4 Census information

To test the generalizability later, the 2015-2019 5-year ACS census data is collected. We assume that median income, race, age, mean commute time and prederence of taking public transportation may also affect travellers’ willingness on bike sharing. So some of varaibles from census data are also predictors in the prediction model.

phl.census<- get_acs(geography = "tract", 
                     variables = c("B01003_001", "B19013_001", 
                                   "B02001_002", "B08013_001",
                                   "B08012_001", "B08301_001", 
                                   "B08301_010", "B01002_001"
                              ), 
                     year = 2019, 
                     state = "PA", 
                     geometry = TRUE, 
                     county=c("Philadelphia"),
                     output = "wide")%>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
                Means_of_Transport, Total_Public_Trans,
                Med_Age,
                GEOID, geometry)%>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

2.5 Create the final space/time panel

The dataset for this analysis must be a complete panel with an observation for every possible space/time combination. So a panel (space/time series) dataset is created first, which includes all the unique space/time interval. There are 118440 unique space/time units in our study.panel.

#join ride share data with census tract
dat_census <- st_join(ride2 %>% 
                        filter(is.na(start_lon) == FALSE &
                                 is.na(start_lat) == FALSE &
                                 is.na(end_lon) == FALSE &
                                 is.na(end_lat) == FALSE) %>%
                        st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
                      phl.census %>%
                        st_transform(crs=4326),
                      join=st_intersects,
                      left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phl.census %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon  = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

length(unique(dat_census$interval60)) * length(unique(dat_census$start_station))

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

nrow(study.panel)

By merging space/time intervals, the final ride.panel is created from actual trips in ride.template with dat_census that saw no trips in study.panel. The census data as well as amenities features are joined with this panel.

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  #left_join(phl.ride)%>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

#join the census data to the ride.panel
ride.panel <- 
  left_join(ride.panel, phl.census %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

#join the amenities features to the ride.panel
ride.panel <-
  ride.panel%>%
  left_join(st_drop_geometry(phl.ride) %>% distinct(start_station, .keep_all = TRUE) %>% select(start_station, ends_with(".nn")), by = c("start_station"="start_station"))

2.6 Create time lags

Creating time lag variables will add additional nuance about the demand during a given time period - hours before and during that day. To test temporal correlation, additional features engineering creates time lags for 1,2,3,4,12, and 24 hours.

#Create time lags
ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
  mutate(day = yday(interval60))

2.7 Spliting training and testing

Two different training and testing datasets were developed: one is a 3 week training set and the other is a 2 week test set.

ride.Train <- filter(ride.panel, week >= 41 & week <= 43)
ride.Test <- filter(ride.panel, week >= 44 & week <= 45)

3. Exploratory Analysis

3.1 Trip_Count serial autocorrelation

Figure 1 shows the overall time pattern from October 8th to November 11st, 2019, where the vertical black line denotes Monday. There is quite similar pattern of each week, with consistent peaks and troughs. There is a slightly decrease of ridershare trips in week 44 and 45, the testing dataset.

mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "周一" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 

rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing")) %>%
  group_by(Legend, interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  ungroup() %>% 
  ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
  geom_vline(data = mondays, aes(xintercept = monday)) +
  scale_colour_manual(values = palette2) +
  labs(title="Figure 1. Rideshare trips by week in Philly: October 8th - November 11st, 2019",
       x="Day", y="Trip Count") +
  plotTheme + theme(panel.grid.major = element_blank())

Table 1 shows the correlation between each time lag feature and the trip count. Overall there is postive correlation between time lag and the trip count but the relationship is not every strong. Typically as the hourly time lag decrease, the trip count decreases. It shows the variation of the ridershare trip in a single day.

plotData.lag <-
  as.data.frame(ride.panel)%>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                "lag4Hours","lag12Hours","lag1day"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
  summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2)) %>%
  kable(caption = "Table 1. Ridershare trip count") %>%
  kable_styling("striped", full_width = F)

correlation.lag

Table 1. Ridershare trip count
Variable	correlation
lagHour	0.44
lag2Hours	0.32
lag3Hours	0.24
lag4Hours	0.18
lag12Hours	-0.03
lag1day	0.42

3.2 Trip_Count spatial autocorrelation

Ride share exhibits strong temporal correlation and now we will explore its spatial autocorrelation. Figure 2 maps the sum of ridershare trips by station and week respectively. The bikeshare trip count distribute unevenly across Philly. Most trips concentrate in the central city, with gradually decreasing amount from the center city to the edges of Philly. The ride share demonstrated significant spatial autocorrelation since its clustering pattern is spatially significant, and such spatial pattern is quite similar across the week.

ggplot()+
  geom_sf(data = phl.census)+
  geom_point(data = dat_census %>%
               group_by(start_station, start_lat, start_lon, week)%>%
               tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.8, size = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(~week)+
  labs(title="Figure 2. Sum of rideshare trips by station and week",
       subtitle = "Philadelphia, October 8th - November 11st, 2019")+
  mapTheme

3.3 Space/time correlation

Since ride share in Philadelphia exhibits very strong spatial and temporal autocorrelation, now we visualize them together. Figure 3 shows the distribution of trip volume by station for different times of the day. Overall, there is a few high volume period while most are low volume. Figure 4 confirms that the ridership data includes lots of low demand stations and very few high-demand stations.

dat_census %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(interval60, start_station, time_of_day) %>%
  tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Figure 3. Mean Number of Hourly Trips Per Station",
       subtitle="Philadelphia, October 8th - November 11st, 2019",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

ggplot(dat_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Figure 4. Bike share trips per hr by station",
       subtitle = "Philadelphia, October 8th - November 11st, 2019",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme

The daily trends in ridership by day of the week and weekend versus weekday is shown in figure 5 and 6. There is significant difference between weekday and weekend, which means that different re-balancing strategies should be considered for weekday and weekends. In workday’s afternoon, the demand of ridership reach the peakest that exceed the peak in the morning, so organizers should allocate more ridership during this time.

ggplot(dat_census %>% mutate(hour = hour(start_time)))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Figure 5. Bike share trips in Philadelphia",
       subtitle = "October 8th - November 11st, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme

ggplot(dat_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday")))+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Figure 6. Bike share trips in Philadelphia - weekend vs weekday",
       subtitle = "October 8th - November 11st, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme

Based on Figure 5 and 6, Figure 7 maps the hot spots of ridership during different periods in weekday or weekend. In workday, the hot spots for morning and afternoon is different, as there are more hot spots concentrating in center city while in the morning, the hot spots just spread around the central city. This also provide guidance for organizers on how to allocate the bike share in spatial aspect.

ggplot()+
  geom_sf(data = phl.census)+
  geom_point(data = dat_census %>%
               mutate(hour = hour(start_time),
                      weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday"),
                      time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                              hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                              hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                              hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
               group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
               tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.8, size = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Figure 7. rideshare trips by station",
       subtitle = "Philadelphia, October 8th - November 11st, 2019")+
  mapTheme

A random day of this study period is picked to see how the ridership distribution changes in one day. An animated gif map of ride share trips over space and time is created as figure 8 shows, where an hour interval is used and the entire animation lasts for 20 seconds. There is a trend of moving from edges to the center in the morning, and moving back in the afternoon. Riderships are heavily used especially around 5-6pm in center city, while in the morning, the hotspots of ridership spread around the center city. This rule follows figure 5-7 above.

#create an animated map
week42 <-
  filter(ride2 , week == 42 & dotw == "周一")

week42.panel <-
  expand.grid(
    interval60 = unique(week42$interval60),
    start_station = unique(ride2$start_station))

station<-ride.panel%>%
  dplyr::select(start_station,start_lon,start_lat)%>%
  dplyr::distinct(start_station,start_lon,start_lat)%>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4236, agr = "constant")%>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))
  
ride.animation.data <-
  mutate(week42, Trip_Counter = 1) %>%
  right_join(week42.panel) %>% 
  group_by(interval60, start_station) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  left_join(station, by=c("start_station"="start_station")) %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                           Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                           Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                           Trip_Count > 10 ~ "11+ trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                              "7-10 trips","10+ trips"))

rideshare_animation <-
  ggplot() +
  geom_sf(data=phl.census)+
  geom_point(data = ride.animation.data, aes(x=start_lon, y = start_lat, color = Trips, size=0.5)) +
  scale_fill_manual(values = palette5) +
  labs(title = "Figure 8. Rideshare pickups for one day in October 2019",
       subtitle = "60 minute intervals: {current_frame}") +
  transition_manual(interval60) +
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  mapTheme

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

4. Modeling and validation

4.1 train and validate several models to compare

Four different linear regression are built with different variables:
- reg1 focuses on just time, day of the week, and weather.
- reg2 focuses on time, day of the week, weather, distance to amenities service, space.
- reg3 focuses on time, day of the week, weather, distance to amenities service, space and demographic characteristics.
- reg4 focuses on time, day of the week, weather, distance to amenities service, space, demographic characteristics and time lag features.
The R square gradually increased from reg1 to reg4, which rised significantly when time lag features are introduced, with the Adjusted R-squared of 0.2938. The most significant factors include Percent_White, lagHour, bus_stops.nn, Percipitation , dotw and hour(interval60). But the R-square is still not really high so further adjustment of parameters is essential.

##reg1 focuses on just time, day of the week, and weather
reg1 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature+ Percipitation + Wind_Speed, data=ride.Train)
summary(reg1)

## 
## Call:
## lm(formula = Trip_Count ~ hour(interval60) + dotw + Temperature + 
##     Percipitation + Wind_Speed, data = ride.Train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3219 -0.7389 -0.4374  0.2535 18.8729 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.0517764  0.0525021   0.986   0.3240    
## hour(interval60)  0.0326292  0.0008268  39.465  < 2e-16 ***
## dotw.L            0.1504102  0.0135014  11.140  < 2e-16 ***
## dotw.Q           -0.2810599  0.0137923 -20.378  < 2e-16 ***
## dotw.C            0.0326047  0.0138598   2.352   0.0187 *  
## dotw^4           -0.2209019  0.0133447 -16.554  < 2e-16 ***
## dotw^5            0.0804585  0.0133252   6.038 1.57e-09 ***
## dotw^6            0.0663899  0.0132691   5.003 5.65e-07 ***
## Temperature       0.0064028  0.0009735   6.577 4.82e-11 ***
## Percipitation    -0.3262121  0.0210675 -15.484  < 2e-16 ***
## Wind_Speed       -0.0067883  0.0010922  -6.215 5.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.325 on 71053 degrees of freedom
## Multiple R-squared:  0.04603,    Adjusted R-squared:  0.04589 
## F-statistic: 342.8 on 10 and 71053 DF,  p-value: < 2.2e-16

##reg2 focuses on time, day of the week, weather, distance to amenities service, space
reg2 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature+ Percipitation + Wind_Speed
           +bus_stops.nn+school.nn+picnic.nn+playground.nn, data=ride.Train)
summary(reg2)

## 
## Call:
## lm(formula = Trip_Count ~ hour(interval60) + dotw + Temperature + 
##     Percipitation + Wind_Speed + bus_stops.nn + school.nn + picnic.nn + 
##     playground.nn, data = ride.Train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5960 -0.7360 -0.4004  0.2412 18.8614 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2.126e-04  5.471e-02  -0.004  0.99690    
## hour(interval60)  3.263e-02  8.219e-04  39.700  < 2e-16 ***
## dotw.L            1.504e-01  1.342e-02  11.206  < 2e-16 ***
## dotw.Q           -2.811e-01  1.371e-02 -20.499  < 2e-16 ***
## dotw.C            3.260e-02  1.378e-02   2.366  0.01796 *  
## dotw^4           -2.209e-01  1.327e-02 -16.652  < 2e-16 ***
## dotw^5            8.046e-02  1.325e-02   6.074 1.25e-09 ***
## dotw^6            6.639e-02  1.319e-02   5.033 4.84e-07 ***
## Temperature       6.403e-03  9.677e-04   6.616 3.71e-11 ***
## Percipitation    -3.262e-01  2.094e-02 -15.576  < 2e-16 ***
## Wind_Speed       -6.788e-03  1.086e-03  -6.252 4.08e-10 ***
## bus_stops.nn     -1.089e+02  1.033e+01 -10.538  < 2e-16 ***
## school.nn        -2.605e+01  2.320e+00 -11.227  < 2e-16 ***
## picnic.nn         2.733e+01  1.988e+00  13.748  < 2e-16 ***
## playground.nn     7.139e+00  2.648e+00   2.696  0.00703 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.317 on 71049 degrees of freedom
## Multiple R-squared:  0.05731,    Adjusted R-squared:  0.05713 
## F-statistic: 308.5 on 14 and 71049 DF,  p-value: < 2.2e-16

##reg3 focuses on time, day of the week, weather, distance to amenities service,space and demographic characteristics
reg3 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature+ Percipitation + Wind_Speed
           +bus_stops.nn+school.nn+picnic.nn+playground.nn
           +Percent_Taking_Public_Trans+Mean_Commute_Time+Percent_White+Med_Age+Med_Inc, data=ride.Train)
summary(reg3)

## 
## Call:
## lm(formula = Trip_Count ~ hour(interval60) + dotw + Temperature + 
##     Percipitation + Wind_Speed + bus_stops.nn + school.nn + picnic.nn + 
##     playground.nn + Percent_Taking_Public_Trans + Mean_Commute_Time + 
##     Percent_White + Med_Age + Med_Inc, data = ride.Train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6871 -0.7521 -0.3764  0.2697 18.6494 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -3.790e-01  7.731e-02  -4.903 9.46e-07 ***
## hour(interval60)             3.336e-02  8.352e-04  39.940  < 2e-16 ***
## dotw.L                       1.539e-01  1.364e-02  11.281  < 2e-16 ***
## dotw.Q                      -2.897e-01  1.393e-02 -20.793  < 2e-16 ***
## dotw.C                       3.282e-02  1.400e-02   2.344  0.01909 *  
## dotw^4                      -2.275e-01  1.348e-02 -16.873  < 2e-16 ***
## dotw^5                       8.301e-02  1.346e-02   6.167 7.01e-10 ***
## dotw^6                       6.952e-02  1.340e-02   5.186 2.16e-07 ***
## Temperature                  6.669e-03  9.834e-04   6.781 1.20e-11 ***
## Percipitation               -3.333e-01  2.128e-02 -15.662  < 2e-16 ***
## Wind_Speed                  -6.940e-03  1.103e-03  -6.289 3.20e-10 ***
## bus_stops.nn                -9.540e+01  1.053e+01  -9.060  < 2e-16 ***
## school.nn                   -4.521e+00  3.289e+00  -1.375  0.16920    
## picnic.nn                    1.624e+00  2.313e+00   0.702  0.48253    
## playground.nn                1.825e+01  3.032e+00   6.018 1.77e-09 ***
## Percent_Taking_Public_Trans -3.848e-01  9.745e-02  -3.948 7.87e-05 ***
## Mean_Commute_Time            8.994e-04  1.734e-04   5.186 2.16e-07 ***
## Percent_White                7.067e-01  3.531e-02  20.017  < 2e-16 ***
## Med_Age                      5.375e-04  8.343e-04   0.644  0.51941    
## Med_Inc                     -7.197e-07  2.455e-07  -2.931  0.00337 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.319 on 69028 degrees of freedom
##   (因为不存在，2016个观察量被删除了)
## Multiple R-squared:  0.07287,    Adjusted R-squared:  0.07261 
## F-statistic: 285.5 on 19 and 69028 DF,  p-value: < 2.2e-16

##reg4 focuses on time, day of the week, weather, distance to amenities service, space, demographic characteristics and time lag features
reg4 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature+ Percipitation + Wind_Speed
           +bus_stops.nn+school.nn+picnic.nn+playground.nn
           +Percent_Taking_Public_Trans+Mean_Commute_Time+Percent_White+Med_Age+Med_Inc
           +lagHour+lag2Hours+lag3Hours+lag4Hours+lag12Hours+lag1day, data=ride.Train)
summary(reg4)

## 
## Call:
## lm(formula = Trip_Count ~ hour(interval60) + dotw + Temperature + 
##     Percipitation + Wind_Speed + bus_stops.nn + school.nn + picnic.nn + 
##     playground.nn + Percent_Taking_Public_Trans + Mean_Commute_Time + 
##     Percent_White + Med_Age + Med_Inc + lagHour + lag2Hours + 
##     lag3Hours + lag4Hours + lag12Hours + lag1day, data = ride.Train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1335 -0.5162 -0.2214  0.2584 15.1390 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.201e-02  6.789e-02   0.177 0.859609    
## hour(interval60)             5.589e-03  7.711e-04   7.248 4.29e-13 ***
## dotw.L                       1.694e-02  1.198e-02   1.413 0.157521    
## dotw.Q                      -1.629e-01  1.225e-02 -13.300  < 2e-16 ***
## dotw.C                       4.958e-02  1.223e-02   4.055 5.02e-05 ***
## dotw^4                      -1.780e-01  1.183e-02 -15.038  < 2e-16 ***
## dotw^5                       1.139e-01  1.182e-02   9.635  < 2e-16 ***
## dotw^6                       4.313e-02  1.170e-02   3.685 0.000229 ***
## Temperature                  6.686e-04  8.666e-04   0.772 0.440392    
## Percipitation               -1.585e-01  1.872e-02  -8.465  < 2e-16 ***
## Wind_Speed                  -1.093e-03  9.675e-04  -1.130 0.258564    
## bus_stops.nn                -3.492e+01  9.208e+00  -3.792 0.000149 ***
## school.nn                   -1.566e+00  2.870e+00  -0.546 0.585297    
## picnic.nn                    6.032e-01  2.018e+00   0.299 0.765033    
## playground.nn                6.470e+00  2.650e+00   2.441 0.014639 *  
## Percent_Taking_Public_Trans -1.329e-01  8.507e-02  -1.562 0.118370    
## Mean_Commute_Time            3.152e-04  1.515e-04   2.081 0.037447 *  
## Percent_White                2.547e-01  3.112e-02   8.187 2.73e-16 ***
## Med_Age                      2.504e-04  7.288e-04   0.344 0.731139    
## Med_Inc                     -2.851e-07  2.143e-07  -1.330 0.183480    
## lagHour                      2.906e-01  3.765e-03  77.196  < 2e-16 ***
## lag2Hours                    7.789e-02  3.909e-03  19.928  < 2e-16 ***
## lag3Hours                    4.230e-02  3.909e-03  10.821  < 2e-16 ***
## lag4Hours                    9.782e-03  3.717e-03   2.632 0.008496 ** 
## lag12Hours                  -2.110e-02  3.305e-03  -6.385 1.73e-10 ***
## lag1day                      2.633e-01  3.481e-03  75.641  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.151 on 68998 degrees of freedom
##   (因为不存在，2040个观察量被删除了)
## Multiple R-squared:  0.2941, Adjusted R-squared:  0.2938 
## F-statistic:  1150 on 25 and 68998 DF,  p-value: < 2.2e-16

4.2 Predict for test data

After running the models, a nested data frame of test data is created by week and a function is built to predict each week.

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week)

model_pred <- function(dat, fit){
  pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
  mutate(ATime_weather = map(.x = data, fit = reg1, .f = model_pred),
         BAmenity_Time_weather = map(.x = data, fit = reg2, .f = model_pred),
         CSpace_Amenity_Time_weather = map(.x = data, fit = reg3, .f = model_pred),
         DTimeLags_Space_Amenity_Time_weather = map(.x = data, fit = reg4, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Trip_Count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions

## # A tibble: 8 x 8
##    week data     Regression      Prediction Observed  Absolute_Error   MAE sd_AE
##   <dbl> <list>   <chr>           <list>     <list>    <list>         <dbl> <dbl>
## 1    44 <tibble~ ATime_weather   <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.815 0.909
## 2    45 <tibble~ ATime_weather   <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.784 0.860
## 3    44 <tibble~ BAmenity_Time_~ <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.806 0.906
## 4    45 <tibble~ BAmenity_Time_~ <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.776 0.856
## 5    44 <tibble~ CSpace_Amenity~ <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.813 0.892
## 6    45 <tibble~ CSpace_Amenity~ <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.784 0.856
## 7    44 <tibble~ DTimeLags_Spac~ <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.669 0.796
## 8    45 <tibble~ DTimeLags_Spac~ <dbl [23,~ <dbl [23~ <dbl [23,688]> 0.651 0.760

4.3 Examine Error Metrics for Accuracy

The Mean Absolute Error (MAE) is calculated for each model and by week. Obviously, reg4 model performs best as its MAE is much lower than the others. This model is more accurate and more generalizable with a MAE value of around 0.6.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette5) +
  scale_x_continuous(breaks = seq(44, 45, by = 1)) +
  labs(title = "Figure 9. Mean Absolute Errors by model specification and week") +
  plotTheme

To further examine the generalizability of these models, the prediction vs. observation of each model across time is ploted in Figure 10 based on the 2-week test dataset. Reg4 model shows the less difference between the observation and prediction, so we confirmed that the model, combining time, day of the week, weather, distance to amenities service, space, demographic characteristics and time lag features, has the best goodness of fit generally. But it does not predict accurately during some peak periods.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         from_station_id = map(data, pull, start_station)) %>%
  dplyr::select(interval60, from_station_id, Observed, Prediction, Regression) %>%
  unnest() %>%
  na.omit() %>%
  gather(Variable, Value, -Regression, -interval60, -from_station_id) %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size = 1.1) + 
  facet_wrap(~Regression, ncol=1) +
  labs(title = "Figure 10. Predicted/Observed bike share time series", subtitle = "Philly; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
  plotTheme

Take a look at the mean absolute errors by station in Figure 11. Stations with the highest MAEs concentrated in center city, while lower stations with lower MAEs are located in the edges of Philly. Beside errors in spatial pattern, we will further analyze the temporal element of error.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon)) %>%
  select(interval60, start_station, start_lat, start_lon, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression == "DTimeLags_Space_Amenity_Time_weather") %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phl.census, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Figure 11. Mean Abs Error, Test Set, Model 4")+
  mapTheme

4.4 Space-Time Error Evaluation

Observed vs. predicted values are plotted for different times of day during the week and weekend, where red line represents prediction trend, while black line represents observation trend. The prediction is underestimated in every period.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw)) %>%
  select(interval60, start_station, start_lat, start_lon, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "DTimeLags_Space_Amenity_Time_weather")%>%
  mutate(weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
  geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Figure 12. Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

By visualizing the MAE in spatial pattern by weekend/weekday and time of day in Figure 13, there are more errors in or around center city during rush hours in workdays, and during mid-day or PM rush in weekends. This spatial pattern follows Figure 11.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw) ) %>%
  select(interval60, start_station, start_lat, start_lon,Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "DTimeLags_Space_Amenity_Time_weather")%>%
  mutate(weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lat, start_lon) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phl.census, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 1, alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Figure 13. Mean Absolute Errors, Test Set")+
  mapTheme

Now we focus on the morning commute, when users will commute from their homes to center city for work. As Figure 14 shows, a few stations with high income, low transit usage and high white people percentage, are resistant to model 4.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw),
         Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
         Med_Inc = map(data, pull, Med_Inc),
         Percent_White = map(data, pull, Percent_White)) %>%
  select(interval60, start_station, start_lat, 
         start_lon, Observed, Prediction, Regression,
         dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  unnest() %>%
  filter(Regression == "DTimeLags_Space_Amenity_Time_weather")%>%
  mutate(weekend = ifelse(dotw %in% c("周日", "周六"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Figure 14. Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

4.5 k-fold cross validation

To validate the result predicted by model 4, random k-fold cross validation is preformed on the 5 week panel by space. The MAEs are small, greater than 0 and its distribution is right-skewed rather than normality, which suggests that the generalizability and accuracy still need some improvement.

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
  geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
  geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
  labs(title="Figure 15. Distribution of MAE", subtitle = "k-fold cross validation",
       x="Mean Absolute Error", y="Count") +
  plotTheme

4.6 Context of accuracy and generalizability

Based on the random k-fold cross validation results, Figure 16 and 17 shows respectively the generalizability by race and income context. MAEs distributed somewhat unevenly in Philly, suggesting that the generalizability and accuracy need further improvements. According to Figure 16, there is smaller MAE where majority of white occupy. Meanwhile, the MAE is higher at the edges of Philly, where the race diversity increased. From figure 17, there is higher MAE at the edge of Philly as well as low-income tracts. Center city has higher income where MAE decreases. However, in center city, MAE with extreme values tend to cluster. Indego ridership organizer should pay more attention to these area. In conlusion, ridership is vastly utilized in region with majority white and high income, and less applied in poor, diverse race area.

#race
  ggplot()+
    geom_sf(data = phl.census, aes(fill = Percent_White))+
    geom_point(data=error_by_reg_and_fold,aes(x = unlist(map(geometry, 1)),
                                              y = unlist(map(geometry, 2)),color=Mean_Error), alpha = 0.9, size=2)+
    scale_colour_viridis(direction = -1,
                         discrete = FALSE, option = "D")+
    ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
    xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
    labs(title="Figure 16. Generalizability by race context",
         subtitle = "random k-fold cross validation on the 5 week panel")+
    mapTheme

#income
  ggplot()+
    geom_sf(data = phl.census, aes(fill = Med_Inc))+
    geom_point(data=error_by_reg_and_fold,aes(x = unlist(map(geometry, 1)),
                                              y = unlist(map(geometry, 2)),color=Mean_Error), alpha = 0.9, size=2)+
    scale_colour_viridis(direction = -1,
                         discrete = FALSE, option = "D")+
    ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
    xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
    labs(title="Figure 17. Generalizability by income",
         subtitle = "random k-fold cross validation on the 5 week panel")+
    mapTheme

5. Conclusion

Based on time, day of the week, weather, distance to amenities service, space, demographic characteristics and time lag features, we built a prediction model for the bike share system in Phliadelphia, where time lag features become very strong predictors. Through Space-Time Error Evaluation, cross validation, test of accuracy and generalizability, the model predicts fairly accurate of ridership in Phlidelphia in terms of both space and time. But during peak times, the predicted results are lower than the observed, which means that more variables should be introduced in our model. Otherwise this might misled organizers to allocate less bikes at the stations and will affect people’s travelling pattern since they could not find or return a bike. Another limitation in our study is that we focus on the hourly interval, which might not be precise since the demand for ridership may change dynamically in an hour. Therefore, further research should consider finer time-scale, like 10 or 15 minutes interval to improve the accurancy of model. If these problems are solved, we will recommend such algorithm for ridership prediction for organizers of Indego.

Considering the spatial disparities of income and race, Indego ridership organizers should not only just focus on the center city, the highest-demanding region of ridership, but also should note the edge regions to make sure that the range of ridership can reach these areas, in order to optimize the re-balancing plan for the whole city. Aside from this prediction model, organizers should know clearly about the urban structure and the land use pattern around each station, take some other measurements that will help better allocating the bicycles, such as expanding stations rationally and promote the flow of public transportation, which ease the pressure people finding or returning bicycles.

References and sources:
1. Tian, Zihao, Jing Zhou, W.Y. Szeto, Lixin Tian, and Wenbin Zhang. 2020. “The Rebalancing Of Bike-Sharing System Under Flow-Type Task Window”. Transportation Research Part C: Emerging Technologies 112: 1-27. doi:10.1016/j.trc.2020.01.015.
2. https://www.rideindego.com/
3. https://www.opendataphilly.org/dataset