Emerging Data Sources Overview and Descriptions - Understanding and Using New Data Sources to Address Urban and Metropolitan Freight Challenges

The Research Problem - Truck Observability

Procuring New and Emerging Data Sources

This section of the Guide provides a brief overview of new and emerging sources of truck activity data, their potential application and fit with existing data, and some of the potential challenges to their integration and use.

While these emerging data create enormous analytical opportunities for those with access to them and the knowledge of how to extract value, working with these data can be challenging. The purpose of each overview in this section is to offer a snapshot of a data source, assess its utility, describe the severity of challenges related to its use, and link utility and value to the high-level strategic goals of practitioners and decision-makers.

While attempting to use new and emerging sources of data, decision-makers may encounter challenges in the following categories:

Policy Issues: regulatory environment, privacy, ownership
Institutional Issues: capacity, stewardship, equity
Technical Issues: completeness, accuracy, verifiability, dynamism

We define each challenge area below. Our research also enabled an assessment of the degree of severity of a challenge for each data source. These ratings are depicted using green, yellow, and red colors in sidebars, where green indicates ease of use of the data source, yellow indicates some hindrance in use, and red indicates difficult to use.

Policy Issues

Regulatory environment is the laws, rules, standards and regulations put into place by federal, state, or other government entities and civilian organizations that are applicable to the emerging data source. Do regulations facilitate or hinder data access and use? As an example, a Federal Communications Commission (FCC) ruling has made it illegal to obtain cell phone location information without the customer’s permission.

Ownership pertains to having legal title or the right to possess something. Ownership of data is tantamount to control; determining ownership defines who can collect, process, access, use, and disseminate data (Zmud, et al, 2016). Ownership also implies who can profit from what is owned. Emerging data sources are increasingly privately collected, wherein there can be competitive advantage to restricting access to the data. Is ownership is tightly controlled by a few entities or is it easily made public or easy to purchase? GPS data are available to any entity that installs or has access to the requisite in-vehicle devices.

Privacy is defined as the capability of individuals to “determine for themselves when, how and to what extent information about them is communicated to others” (Westin, 1967). This is particularly relevant to privacy of Personally Identifiable Information (PII), which is any data that could potentially identify a specific individual. A single piece of data can be PII, such as a social security number. Likewise, multiple pieces of data when merged can be PII, even when the individual pieces would not be. The policy question is: How easy is it to balance privacy and analytic specificity? As an example, a license plate number does not identify a specific person; rather, it identifies a vehicle. However, the license plate number may be linked with an identifiable person and provided analytical specificity through a linkage with other information, such as home or work locations.

Institutional Issues

Capacity is defined as the ability to do something, whether that means having the necessary skills, personnel, or technical back-end systems. The amount of data captured by the source will impact capacity as it places demands on the required processing and storage capacities. How easy is it to work with the data source? Does this require special skills (i.e., specialization in data science, proprietary analysis methods), hiring of new staff, or investment in back-end systems? For example, as public agencies try to work with emerging data sources, they often lack the necessary experienced and skilled workforce (Tomer and Shivaram, 2017). The lack of skilled personnel can result from a limited fiscal capacity to hire data scientists as public agencies struggle to compete with salaries offered by the private sector or to provide adequate training in data science skills. The resulting experience and training gaps limits an agency’s ability to obtain data in a usable format, analyze it and effectively deploy it is planning and decision making.

Stewardship refers to the design and application of data management principles covering collection, storage, retention, aggregation, de-identification, and procedures for data access, sharing, and use. The concept of a data steward is intended to convey a fiduciary (or trust) relationship with the individuals whose data are stored and managed by the steward. When not the collectors of raw data, public agencies are the stewards of data procured from third party providers. Are the data entrusted to the public agency’s care and how easy is it for the public agency to carefully and responsibly manage the data entrusted to its care? For example, the effective use of emerging data sets requires strong cross-agency collaboration, including the ability to scale costs across multiple budgets. This can run counter to the siloed functioning of many agencies related to the built environment.

Equity refers to the fairness with which transportation impacts (benefits and costs) are distributed. When applied to a source of data, equity is concerned with whether policy and planning decision making based on the data will be equitable across people, geographies, modes, etc. Are the data representative of the population, geography, etc., of the planning or policy decision? For example, GPS data are often limited to certain types of vehicles or classes of individuals.

Technical Issues

Completeness refers to whether there are gaps or holes in the data available; it does not necessarily mean volume of data. Such gaps might be sample limitations, data collected only at certain hours, or in daylight only. Do data gaps or holes negatively impact use of the data for addressing urban or metropolitan challenges?

Accuracy is linked to the completeness of the data but also encompasses such things as the processes and methods used to collect the data and the level of precision that results. Do the data represent the true state of the patterns or elements being measured? As an example, location-based data from cell phones have a high degree of uncertainty due to the pinging of the data from cell phone towers.

Verifiability is the degree to which the origins of the data and the manipulations which the data have undergone are documented. In the case of aggregated data, this means pointers to where the original data came from and documented processes and assumptions for how to reproduce analyses. Are the source data and associated manipulations consistent and verifiable? For example, data from computer vision system are typically derived from automatic methods of data analysis, known as machine learning, to analyze these images to extract knowledge. One might have the extracted data (i.e., knowledge) but not the source images.

Dynamism is defined as the lag between the data being captured and the data being accessible. Are the data updated or accessible on a continuous basis or is it a snapshot of data taken at a given point in time and once archived will not be further updated until the next snapshot. Here we are concerned with data recency –How long does it take, from the time of capture, to make data available for analysis?

Durability refers to whether the data source will be supported or available over the long-term. If so, it is likely that its accuracy or an agency’s capacity in using it could improve further over time. Assessing the durability of the data sources also gives a measure of how valuable the data source could become, even it is currently incomplete or limited in scope. Is the data from a source durable over time? For example, GNSS and GPS systems are under the auspices of the U.S. Department of Defense (DOD) which means that their operation will be sustained for the long-term.