ORCID Identifier(s)

0000-0001-6073-3896

Graduation Semester and Year

2022

Language

English

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Civil Engineering

Department

Civil Engineering

First Advisor

Stephen P Dr

Second Advisor

Kate Kyung Dr. Hyun

Abstract

ABSTRACT: Despite increased awareness at the state and federal levels of the design and policy concerns associated with nonmotorized infrastructure, communities typically develop strategies to estimate network-wide nonmotorized traffic. However, past research fails to provide the same level of estimation sophistication as motor vehicles due to a lack of investment, which limits increased understanding and capability among communities in network-wide nonmotorized traffic volume estimation. Although previous studies introduce machine learning, Neural Networks, and feature engineering to motorized traffic volume estimation, nonmotorized traffic estimation experiences limited attempts to use these advanced techniques. This dissertation addresses this deficiency in bicycle volume estimation for communities throughout the United States by using advanced algorithms and emerging GPS-based data to create sustainable deliverables for communities. Mainly, this research addresses four main research questions: 1. What research gaps and future research scopes in nonmotorized bicycle volume estimation currently exist? 2. What benefits and challenges emerge from crowdsourced data and short-term counts to estimate network-wide bike volumes? 3. How can data fusion and modeling techniques such as different buffer sizes and variable selections improve the accuracy of network-wide bicycle volume prediction? 4. Can crowdsourced data serve as the sole source for estimating daily or annual average volumes without supplemental data? This dissertation addresses these research questions using a literature review, advanced modeling, and transportation knowledge. The data analysis uses Python and R Programming. This research demonstrates new methods for transportation engineers and planners to use advanced techniques and emerging data and improve nonmotorized traffic count estimation. Insights from these improved estimations will create a more holistic image of transportation infrastructure planning and design in the USA and other countries of the world. Lessons from this study can be applied to many projects like sustainable green transportation projects, motorized vehicle projects, public health projects, and complete streets. Increasing commute and recreational cycling activities provides health benefits and motivates regional and local agencies to plan and build safe cycling infrastructure. An accurate bicycle volume estimation allows agencies to track and evaluate use rates and safety and public health impacts; this data may support the planning and design of the bicycle infrastructure by assessing the benefits of the infrastructure. In the past 25 years, several researchers adopted different modeling approaches and data sources to estimate bicycle volumes. The recent use of emerging crowdsourced data such as Strava, Streetlight, and bike share provides an opportunity to enhance modeling accuracy; however, the lack of a comprehensive review of the bicycle volume estimation techniques assessing the current research gaps in data and modeling makes determining the most effective and accurate strategies to estimate bicycle volumes challenging. This article provides a detailed review of 58 studies published after 1996 in peer-reviewed journals and conference proceedings. This work categorizes past research articles by their objectives, modeling techniques, data integration, model transferability, and performance; the investigation also provides an examination of emerging data. Based on the review, the study documents the current research gaps and recommends future research directions to improve data source evaluations, variable creation, modeling, and scalability/transferability advancements. Emerging sources of mobile location data such as Strava and other phone-based apps may provide useful information for assessing bicycle activity on each network link. Despite their potential to complement traditional bike count programs, the representativeness and suitability of these emerging sources for producing bicycle volume estimates remain unclear. This study investigates the emerging data challenges and opportunities by fusing Strava data with short-term and permanent conventional count program data to develop bicycle volume estimations using clustering and non-parametric modeling. The analysis indicates that the concentration of permanent counters at high bicycle volume locations presents a significant challenge for producing network-wide daily volume estimations even though Strava data demonstrate potential in mitigating the estimation bias at lower-volume sites. Despite the contribution of Strava to develop reliable and spatially and temporally transferable bicycle volume estimations, significant challenges remain to rely on Strava counts alone to characterize network-level activities due to sampling bias. This study will help planners discern and assess the challenges and opportunities of using emerging data in bicycle planning. Recent advancements in smartphone-based location data that collect and process large amounts of daily bicycle activities make using machine learning techniques for bicycle volume estimations possible and promising. Machine learning (ML) architecture has successfully characterized complex motorized volumes and travel patterns; however, nonmotorized traffic sees fewer ML efforts and typically relies on simple econometric models due to insufficient data for complex modeling. This study develops eight modeling techniques ranging from advanced techniques, such as Convolution Neural Network (CNN), Deep Neural Network (DNN), Shallow Neural Network (SNN), Random Forest (RF), XGBoost, to conventional and simpler approaches, such as Decision Tree (DT), Negative Binomial (NB), and Multiple Linear Regression, to estimate Daily Bicycle Traffic (DBT). This study uses 6,746 daily bicycle volumes collected from 178 permanent and short-term count locations from 2017 to 2019 in Portland, Oregon. The modeling considers a total of 45 independent variables capturing anonymous bicycle user activities (Strava count, bike share), built environments, motorized traffic, and sociodemographic characteristics to create comprehensive variable sets for predictive modeling. The modeling investigation also deploys two variable dimension reduction techniques, principal component analysis and random forest variable importance analysis, to ensure that the models are not over-generalized or over-fitted with a large variable set. The comparative study between models shows that the SNN and DNN machine learning techniques accurately estimate daily bicycle volumes. The results show that the DNN models predict the DBT with a maximum mean absolute percentage error (MAPE) of 22%, while the conventional model (linear regression) shows an APE of 45%. One of the most common modeling forms to estimate Average Annual Daily Bicycle Traffic (AADBT) is a direct demand model (DDM) that uses demographic, network, and traffic as explanatory variables. The performance of the DDM is subject to variable preparation and collection methods and researchers commonly apply a buffer to aggregate and represent bicyclists and bicycle link characteristics. The majority of previous studies use a GIS tool to extract the variables at different buffer levels to identify the optimal buffer sizes and types that work best for their study area and count data. To overcome a time-consuming and labor-intensive effort on variable extraction, this study develops and tests a wide range of variables using various buffer types (Euclidean and Network) and sizes and compares their modeling performances. This study uses emerging count data (Strava, StreetLight) with contextual variables to develop Poisson regression models where OpenStreetMap (OSM) data plays a key role in standardizing the network data collection. The research develops models for six different geographies (Portland, Eugene, Bend, Boulder, Charlotte, and Dallas) as city-specific models as well as a generalized model that integrates all the data from the six cities. The results indicate that 0.5 miles of Euclidean and Network combined buffers perform best for the generalized model with a goodness of fit of 0.75 while Euclidean and Network buffers individually have similar performance for 0.5-mile buffer sizes. The city-specific models indicate that local characteristics of the geography influence the buffer size and types. However, none of the cities requires more than a 0.5-mile buffer to obtain the best model except Boulder. Although this study demonstrates that the city-specific model does not require buffer sizes of more than one 0.5-mile, more geographic areas may provide better information to classify the city and suggest/recommend particular buffer types and sizes needed to extract the data and get the best estimations of AADBT from DDMs. Future research should consider a wide range of geographic areas and investigate generalized rules to set appropriate buffer types and sizes. Future research should also randomly choose the number of cities to investigate the effects of buffer sizes and types selection in generalized models. Moreover, this study considered a 60 m radius tube along the road centerline for a Network buffer which can be adjusted to understand the impacts of the radius on capturing land use characteristics within a Network buffer. Currently, researchers and agencies use different techniques to aggregate the data for DDM without knowing the most best steps to reduce error because no policies or systematic guidelines for agencies to to adopt for variable aggregation exist. The findings of this study provide guidelines for an agency to consider when specifying variables using buffers, Models combining multiple geographic areas likely require larger buffers to create generalized findings. This study recommends considering both the nature of the independent variables (e.g. land use, socio-demographic, and network) and data collection locations when selecting buffer types, and study area density when selecting buffer size.

Keywords

Bicycle, Volume, Daily bike volume, Network, Review, AADBT, Data, Model, Direct demand model, Performance, Strava, Streetlight, Bike share, Treed regression, Deep neural network, Shallow neural network, Machine learning

Disciplines

Civil and Environmental Engineering | Civil Engineering | Engineering

License

This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Comments

Degree granted by The University of Texas at Arlington

Recommended Citation

Miah, Md. Mintu, "Data Fusion for Non-motorized Volume Estimation: A Machine Learning Approach" (2022). Civil Engineering Dissertations. 155.
https://mavmatrix.uta.edu/civilengineering_dissertations/155

Civil Engineering Dissertations

Data Fusion for Non-motorized Volume Estimation: A Machine Learning Approach