1 Introduction

As migration into urban areas continues unabated (United Nations 2019), policy makers are forced to navigate new challenges of sustainability and risk management in the midst of increased emissions and erratic climatic events (Barbour et al. 2019; Reinhart and Cerezo Davila 2016; Diffenbaugh 2020). Adhering to sustainability measures associated with updating existing buildings to new green standards, understanding the movements of people and their use of buildings, and managing the risk to population associated with natural or man-made events have resulted in more research into building occupancy at various spatiotemporal scales for green building technologies, risk analysis, and population modeling (Stewart et al. 2016).

Research on energy consumption reveals that 40% of the energy consumed in the United States in 2018 went to residential and commercial buildings (U.S. Energy Information Administration 2020). This research continues by capturing the building dynamics for optimization of building energy consumption, from exploiting sensor and digital traces within buildings for indoor positioning and mapping (Hossain 2019), or quantitative methods for modeling occupant behavior (Hong et al. 2020), to modeling transportation movements within a city to estimate population movement to transform an existing database of building occupancies (Berres et al. 2019).

Exploration into better methods to report where people may be expected in the built environment has led to advances in remote sensing technologies and machine learning for extraction of global settlement layers (Palacious-Lopez et al. 2019) to focus population modeling in the built environment. Building extractions have facilitated the development of building footprints (Yuan et al. 2018) and, coupled with census and survey data, supported the modeling of populations at the residential building level (Weber et al. 2018; Fecht et al. 2020). While the population is modeled at the building level, there remains opportunity to focus research on local building dynamics to capture visitors or transient populations. Social media and mobile phone data have been harnessed to capture population dynamics at higher temporal scales through Points of Interest (POI) and the open and close times of businesses (Sparks et al. 2020). Facebook and Google popularity curves have also been used in the development of 24-h building occupancies (Lu et al. 2020).

The seismic community also continues to perform research into hazards and risk involving exposure datasets, including building occupancies for development of probabilistic risk models at various spatial scales (Silva et al. 2018; Yepes-Estrada et al. 2017; Silva et al. 2020). This is accomplished through the collaborative support of the worldwide Global Earthquake Model (GEM 2020) and Pager (USGS 2020) programs. The World Housing Encyclopedia (WHE 2014) is an open community website that reports various building types, occupancies, and construction for the seismically active areas of the world to support more earthquake-resistant building practices. This information is developed through a collection of sources and harmonization techniques of expertise and open source data (Jaiswal et al. 2011; WHE 2014).

Oak Ridge National Laboratory developed the web-based Population Density Tables (PDT) global learning framework to report national and subnational building occupancies supported by a Bayesian framework (Stewart et al. 2016). The PDT portal is not openly accessible, but available to the US Federal government through account requests as required by this project. The occupancy estimates are reported as people/1000 ft2 for over 50 facility types (Fig. 1a) for night, day, and episodic (larger gatherings of people; stadiums—soccer matches; churches—sermons). Baseline statistical occupancy models are developed at the national or subnational level using expert knowledge, experience, or open source data and are updated by open source observation (Stewart et al. 2016).

Stewart et al. (2016) discusses the Bayesian statistical modeling process in depth and how the Bayesian learning process easily handles disparate open source data, surveys, expert knowledge, and experience. The use of observation models to report ambient occupancies was proven mathematically acceptable by Morton (2013). These models update and refine baseline occupancy estimates (priors) by dynamically updating reported building occupancy estimates (posteriors).

In developing a building occupancy learning system, challenges were encountered when scaling this research to the world: (1) the diversity of building types and geographic divisions, which requires an array of observation models to capture local socio-cultural activity through open source data and the development of proxies for data-poor areas; (2) reporting complete transparency and data provenance into the observation and the variability and uncertainty about the reported occupancies; and finally, (3)determining how to interpret the reported occupancies and uncertainty for spatial application.

2 PDT observations

The PDT system currently holds over 50 K observations for different geographic and spatial scales. A systematic process for the development of observation models using open source data was operationalized where multiple users simultaneously support PDT learning updates. This robust approach, which includes a review cycle prior to publication into the PDT system, establishes a level of consistency and accuracy in data input and a reduction in development error and results in transparency in occupancy reporting.

Figure 2 lays out the systematic process to developing observations. These models use authoritative open source data including academic journals, official government statistics, corporate and university webpages, tourism brochures, and surveys. Relevant observable data about the facility are extracted, and an observation model is selected. Uncertainty is reported through ranges (i.e., almost 300 employees or 1–3 floors) and surrogates developed for data variables not reported but required. The model is submitted for review, and once approved a facility observation PDF is developed. Finally, the priors are updated to posteriors by the resulting facility PDF within the Bayesian framework.

Fig. 1
figure 1

a PDT facility types captured under seven land use categories. b The Employee observation model supports all facility types under the green captioned land uses; Commercial, Retail and Service Outlets, Recreation/Entertainment, and Institutions. Hotels, Hospitals, Schools, Prisons and Cemeteries are represented by a different set of observation models, but fall under the same land use

Fig. 2
figure 2

PDT observation model workflow from open source to data measurement under uncertainty. PDF = Probability Distribution Functions

Ambient building occupancy estimates are reported in PDT as people/1000 ft2. Unfortunately, a query of open source statistics reveals data reported in precisely these units are indeed rare. Fortunately, there is a wealth of information from which one can infer these density units, given the right observation model. Observation models are simply equations or algorithms that allow one to estimate ambient density from a limited but sufficient set of open source observables. Values like total number of visitors, number of employees, job shifts, average visit durations, and building square footages are sufficient to estimate ambient occupancy and bypass intractable methods such as direct and constant surveillance of an operating facility.

Creation of observations is itself a necessary but insufficient condition. While Morton (2013) shows strict mathematical equivalence given strict input values, we often have uncertain observable values. Indeed, we sometimes find ourselves in a pseudo quantitative description such as “the museum had almost 1 million visitors last year” or “most visitors stay 2–3 h”. Furthermore, sometimes building characteristics are unclear. For example, in near NADIR satellite imagery, we may only be able to discern that a building has 1–3 stories. This ambiguity from source data can be used to model reasonable uncertainty. For example, one may choose a heavily left right Beta distribution on 0–1,000,000 m for the visitors, a family of log-normal distributions with modes between 2–3 for visit duration, and a discrete, equiprobable distribution over 1–3 for building height. We can measure the sensitivity and impact of different choices of posteriors to these choices. Treating observable inputs as random variables means that the observation models are indeed random functions, and we arrive at a means for representing (albeit imperfectly) uncertainty in the measured density observation. Using the methods outlined in Stewart et al. (2016), those data then move the prior to the posterior while considering the uncertainty in both.

What happens when a facility is missing an observable? For example, we may lack information about visit duration for a particular museum. As an engineering solution, PDT turns to default distribution model observables across a range of similar facility types. These distributions can fit using similar Bayesian learning principles or frequentist fits, but they will result in larger variances. This increases overall uncertainty in the observation density but allows PDT to move forward using only partial observables.

It is worth noting that uncertainty introduced by temporal lags has not been explicitly considered. If a facility’s observable data are 2 years old, how should that be weighed against a similar facility’s data that are only 2 weeks old? Presently, this remains an open question handled in an ad hoc approach based on subject matter expertise about when old is too old. Alternatively, we can assess the temporal stationarity of the observable by testing the hypothesis that the older data point comes from the same distribution as newer observables. In the main, the use of observation models and surrogate distribution for missing observables allows PDT to move forward with salient inputs for advancing priors to posteriors in the Bayesian learning framework.

Development of observation models has been largely based on informing encounters with open source content. At present there are close to 50 distinct observation models that can be grouped into classes handling a typical set of open source data streams. Often these mirror facility types, which may report similarly, but not reliably. Figure 3 lists the model classes, which mostly align with a single facility type, although two, Reported and Employee, contribute to multiple facility types. In fact, the Employee observation model supports more facility types than other models as is shown in Fig. 1b. We now describe these classes and point to exemplar models found within them.

Fig. 3
figure 3

Open source observation model classes in blue at the head of the list and models below. A few models report alternative approaches to accommodate available data

2.1 Reported

The Reported class is the simplest form of occupancy from known numbers of people reported within a certain space. This reporting can represent a certain time frame (hour or hours) of normal occurrence of occupancy, episodic or a large gathering of people for a function, or the occupancy capacity of the space.

$$pdt=1000\cdot {A}^{-1}\left[p\right].$$

The pdt model reports people/1000 ft2, and A is the area of the building. The population, \(p\), is simply divided by the area to report occupancy. The area may or may not be reported, but if a specific building and location is known, the area of the building can be digitized. The number of floors of the building can be determined through imagery or photos, and any uncertainty about that number is captured as an input range. This Simple model supports all building types within PDT, whereas Episodic and Capacity models are not as widespread.

2.2 Residential

Census and housing surveys are considered the best data available for capturing residential occupancy. However, the census or survey data collected in their original form are not available for direct use, and the reported results are masked (e.g., interval data, truncated), people and area are provided separately (e.g., average household size or average dwelling size), or are a combination of masked and averages. RevengC (Duchscherer et al. 2018) was developed to reverse engineer the various combinations of censored tables or averages through use of statistical algorithms such as Iterative Proportional Fitting, Maximum Likelihood, Minimum Chi-Square, and Weighted Least Squares. Supplementary inputs of lower and upper bounds of household size and dwelling size area are required in addition to the table values or averages and are subsequently published within PDT to produce probable estimates of people/1000 ft2 without uncertainty.

For areas of the world where there is a lack of recent census or surveys, the Residential Land Use (Fig. 1a) building functions are reported through the Housing model. This model accounts for the percentage of people working and who are of school age within a household to more accurately represent the number of people expected at home during the day.

$$pdt=1000\cdot {A}^{-1}\left[\left(\left({c}^{0-4}\cdot h\right)+{c}^{65+}\cdot h\right)+\left(\left(1-p\right){c}^{5-9}\cdot h\right)+\left(\left(1-s\right){c}^{10-14}\cdot h\right)+\left(\left(h\cdot {d}^{15-64}\right)u\right)\right].$$

To account for people at school and work at the household level, this model uses household size and percent school attendance. It is important that age groups align with primary and secondary school attendance. However, exceptions have been made where this alignment does not occur. For the model shown, the breakdown of age groups results in \({c}^{0-4}\), \({c}^{5-9}\), \({c}^{10-14}{c}^{15-64},\) and \({c}^{65+}\). Therefore, \(a\) is multiplied by \(h,\) which is the household size to determine the outcome of the number of children from 0–4 of a household at home diurnally. Again, \(h\) is multiplied by \(e,\) which is the number of people in the retired age group (65+) in a household at home during the day. The school age children and working population are handled differently. The number of children not in school, \(p\), which represents the percent primary school attendance, is subtracted from 1, and the result is multiplied by the household size. This is performed again for the secondary age school children to account for their percentage of the household size at home. For the working age population, \(d\) is multiplied by \(u,\) which is the unemployment for the geographic area.

Finally, refugee and internally displaced people (IDP) settlements follow the Housing model. Data collected for that particular settlement is collected from various humanitarian organizations that report on populations displaced from their homes due to conflicts or disasters.

2.3 School

Schools and universities are modeled to account for how many students, faculty, and staff will be at the facility at any time during the school hours. The types of data found vary in detail and describe demographic information on students, faculty and staff, school location, and floor and height area and its state of use. In most cases, only enrollment is reported, and efforts to verify the location are required. The model below captures the data necessary to develop the observation model, with some exceptions, such as using proxies as placeholders in the observation model until better data is secured.

$$pdt=1000 \cdot {A}^{-1} \left[\frac{(\left(e\cdot p\right)\cdot \overline{d })+\left(f\cdot \overline{j })+\left(s\cdot \overline{t } \right)\right.}{z}\right].$$

The student enrollment is represented by \(e\), which is multiplied by the percent attendance, \(p\). Many schools report enrollment, but the actual attendance is sometimes quite different. The result of \(e\cdot p\) is further informed by the average number of students, \(\overline{d }\), who may not be attending school due to sickness, appointments, or other errands, for example. This process is also performed for faculty, \(f\), and staff, \(s\), informed by average faculty \(\overline{j }\) and average staff \(\overline{t }\), respectively. Last of all, \(z\) accounts for whether the school operates in shifts.

Universities are represented by the same student model and enrollment, faculty and staff, excluding shifts and percent attendance. The area of the school is either reported, or the buildings that support administrative functions, learning, and research are digitized. On-campus housing is handled separately through the Residential model for Multi Family.

$$pdt= 1000 \cdot {A}^{-1}\left[\left(e\cdot \overline{d }\right)+\left(f\cdot \overline{j })+\left(s\cdot \overline{t } \right)\right.\right].$$

The university student enrollment is also represented by \(e\), and in this case, \(\overline{d }\) represents the average attendance without accounting for the percent attendance.

2.4 Employee

The employee model is a step beyond the simple model and accounts for employees reported for a business where activity within the building is performed by staff. This is the most widely used model across Institutions, Commercial, Retail and Service Outlets, and Recreation/Entertainment land use categories and associated building types (Fig. 1a and b).

$$pdt=1000\cdot {A}^{-1}\left[w \cdot \overline{w }\right].$$

The employees are represented by \(w,\) and the average number of employees by \(\overline{w }\).

The visitor-employee model was discussed in detail in Morton (2013) and Stewart et al. (2016) and is applied here to a wider range of service activities that fall under multiple building type activities (Fig. 1b) The annual visitors are \(a\), average visit time is \(\overline{v }\), number of days open each year is \(d\), and \(\overline{h }\) is the average number of hours the museum is open. The number of employees is \(w\), and \(\overline{p }\) is the average number of employees at work.

$$pdt=1000\cdot {A}^{-1}\left[\frac{a\cdot \overline{v} }{d\cdot \overline{h} }+w\cdot \overline{p }\right].$$

There are variations on this model based on self-reporting by businesses such as the number of visitors reported either daily, weekly, monthly, or annually. The visitors and employees drive the majority of activity spaces captured under Retail and Service Outlets, Institutions, and Recreation/Entertainment. There are some exceptions such as Hotel or Schools where there are specific observables readily available that fine tune the model to better define the ambient activity, e.g. hotel occuapancy.

The Shift Worker model accounts for multiple worker shifts whether in manufacturing or a public place such as a restaurant.

Number of shifts = 1

$$pdt=1000 \cdot {A}^{-1} \left[w \cdot \overline{p }\right].$$

Using manufacturing as an example, this model allows for additional worker schedule options from the Employee model via the input argument for the average number of workers, \(\overline{p }\). The user identifies the average number of workers for one shift, and the majority of the workers are modeled during the day and few people at night for security or cleaning crew. For two shifts, a higher average number of workers is expected during the day because of administrative and other services only offered during the daytime, but a certain number of workers are required to continue the night time manufacturing process.

Number of shifts = 2

$$pdt=1000 \cdot {A}^{-1} \left[w \cdot \overline{q }\right].$$

The 2 shifts model requires a different average number of workers, represented by \(\overline{q }\).

2.5 Cemetery

The Cemetery activity is modeled as open space rather than building space and is represented by activities including burials, graveside visits, memorial services, walking paths, tourists, and even IDP or refugee camps. Because there is limited information reported on cemetery visitation, the Visitation model accounts for graveside visits through the function of the number of graves in the cemetery. This count was originally performed manually, which is a time-consuming and error prone task with potential inconsistencies in counts due to human error without acknowledging uncertainty in the results. As a result, we leveraged existing capabilities in deep learning to support automated grave counting to report uncertainty (Lunga et al. 2020).

$$pdt=1000 \cdot {A}^{-1}\left[\frac{\left(g\cdot \overline{g }\right)\cdot \overline{h} }{\overline{v} }\right].$$

The area, \({A}^{-1}\), for the Visitation model is different from the previously discussed models because it is the extent of the open space, or cemetery itself. The number of graves, \(g\), is multiplied by the average number of graves visited, \(\overline{g },\) and visitation is average hours visit time, \(\overline{h }\), and average hours open, \(\overline{v }\).

The Episodic model supports burial activities or memorials where larger crowds are expected. The local socio-cultural activities surrounding the burial practice dictate where a burial will occur, the length of ceremony, and the number of visitors or group size at the burial ceremony. In this model, household size represents the basis for how many people may be attending a funeral for family, with group size for the extended family and additional people from the village included if local funerary practice includes community attendance.

$$pdt=1000 \cdot {A}^{-1}\left[\left(j\cdot \overline{k }\right)+l\right],$$

where the household size is \(j\), average family attendance is \(\overline{k }\), and group size is \(l\).

2.6 Prison

The Prison observation class of models captures either the baseline static prison population; prisoners and employees; or prisoners, employees, and visitors based on prison operations and data availability. Prison operations differ around the world, but generally the baseline static inmate population represents the vast majority of the population, with small variances due to employees and visitors.

$$pdt=1000\cdot {A}^{-1}\left[u+\left(w\cdot \overline{p }\right)+\left(v \cdot \overline{V }\right)\right].$$

The Prisoner model accounts for the number of prisoners, \(u\), while the employees are once again represented by \(w\), and \(\overline{p }\) is the average number of employees. The visitors, \(v\), represent the number reported and the average visitors, \(\overline{V }\).

The Capacity model requires the prison occupancy rate in lieu of reported prisoner populations.

$$pdt=1000\cdot {A}^{-1}\left[\left(c\cdot \overline{O }\right)+\left(w\cdot \overline{p }\right)\right],$$

where \(c\) is the capacity of the prison or cell, and \(\overline{O }\) is the average occupancy to produce the capacity prisoner population. The employees, \(w,\) and average employees, \(\overline{p },\) represent the same employee situation as in the Prison model.

2.7 Hotel

The observation models for hotels uniquely account for the area where people are expected to sleep and excludes shops, conference centers, or other similar venues. Another unique factor is the use of social media to inform whether the hotel is primarily occupied by business people or vacationers, as we expect lower occupancy in primarily business-oriented accommodations to account for fewer people under the business scenario and families under a visit for family fun scenario or a mixture of both. There are four scenarios for determining hotel area, \(A\). (1) If room area is not available, the hotel footprint is digitized, excluding the other services within the hotel, and multiplied by the number of floors. (2) The next best option is the total number of rooms multiplied by a range of guestroom areas if the number of guestrooms for that particular area are not reported. (3) The total number of rooms is multiplied by a range of guestroom and suite areas separately (this is a slightly better option than 2). (4) And last of all, and the best option, the number of rooms and suites and their specific area are reported.

$$pdt=1000\cdot {A}^{-1}\left[\left(\left(t\cdot g\right)\cdot \overline{O }\right)\cdot \overline{p }+\left(\left(t\cdot e\right)\cdot \overline{O }\right)\cdot \overline{q }\right].$$

The Occupancy model area, \(A\), of the guestrooms and suites is found using one of the four scenarios outlined above. The first half of the equation accounts for the number of guests within the hotel, and the second half, the number of employees, which we estimate based on the hotel star rating. The total number of guestrooms (and suites) is \(t\) and guests per room is \(g\), the average occupancy rate for the hotel is \(\overline{O },\) and the average guests is \(\overline{p }\). And again, the total number of guestrooms is \(t\), and this time, the number of employees per room is \(e\), the occupancy rate is \(\overline{O }\), and the average number of employees is \(\overline{q }\).

$$e=m \cdot r.$$

The number of employees per room, e, is defined by the star rating of the hotel, \(r\), and the number of rooms, \(m\). The star rating is based on the World Tourist Organization’s report that is referenced in the city-of-hotels website about the number of employees necessary to support the hotels (https://www.city-of-hotels.com/165/hotel-staff-en.html).

2.8 Port

Ports encompass a variety of activities including freight or oil movement, cruise line terminals, and ferry services, which may all occur within one port.

$$pdt=1000\cdot {A}^{-1}\frac{\left(s+p\right)\cdot \overline{p}}{z } .$$

The area, \(A\), is handled differently based on the function of the port. Where freight or oil is involved, the entire area of the port (open areas and buildings) is included. In this model, employees for ships, \(s\), are assessed by the number of docks and possible types of ships delivering goods. Both the port employees, \(p\), and the average number of employees, \(\overline{p },\) at the port are included. Shifts, \(z\), are a common occurrence at ports, but they may differ based on activity.

$$pdt=1000\cdot {A}^{-1}\left(s+(p\cdot \overline{p }\right)+\overline{n }+\frac{\overline{v} }{\overline{h} } .$$

In ports that primarily handle passenger ships (ferry or cruise), the area, \(A\), of the terminal is measured for embarking and disembarking activity only. Passengers for the cruise ship or ferry, \(n\), can be accounted for daily, weekly, monthly, or annually. If daily, \(s\) and \(p\) account for ship and port employees, respectively, the average number of employees, \(\overline{p }\), the average number of hours spent waiting for the ferry or ship, \(v\), and average hours open during the day, \(\overline{h }\).

3 Results

A systematic approach to developing observations optimizes the process of accommodating new disparate open source data to update existing models and guides analysts through the development of the models even as new updates are unveiled. This approach minimizes human errors in the development of the models, establishes the input data necessary for the model, and permits analysts to develop model updates to accommodate new data.

The analyst is guided through the development of models with requirements for discussion about the use of input data and sources within the development process, promoting transparency. Because the analyst is guided through the process consistently, they easily develop observations for a building function and geographic area, but also optimize their time in data collection through identification of the necessary data inputs. This approach encourages continuous review of existing models as new or updated data are discovered, or as opportunities to refine the existing process to accommodate new and better methods emerge. Each part of the imputed model is available within the system for users to interrogate.

Figure 4 consists of three panes of reported occupancies and supporting data from the PDT portal. The left pane lists the country building functions and the reported probability estimates by percentile for night and day: 10th, 50th and 90th. Shown in the middle pane is a list of observations published in the PDT system for Philippine hospitals and their night and day occupancies. The right pane shows the uncertainty for the Philippine hospitals, from the uncertainty captured at the data input level and propagated throughout the Bayesian modeling process to reveal the full spectrum of probable building occupancy estimates for that country. Taken together, we can clearly see and interrogate the variability in building occupancies reported for Philippine hospitals and the open source observations.

Fig. 4
figure 4

National ambient building occupancies reported in the left panel for 10th, 50th, and 90th probabilities for day and night. Observation models reporting ambient hospital occupancies for day and night in the middle, and the probability distribution for all observation models after publication into the PDT learning system on the right

For the 50th percentile in the right pane of Fig. 4, the uncertainty range is reported below the probability distributions in the green box (day) as 3.39–3.81 and in the blue box (night) from 1.74 to 2.01 people/1000 ft2. Uncertainty in population distribution is given in PDF charts depicting the ensemble of probable occupancy estimates for the thickness for the Philippine hospitals. The thicker or hazier the curve, the more uncertainty about the occupancy estimates for that percentile. If another percentile is selected, the information in the right pane updates with the new percentile uncertainty and the uncertainty ensembles are redrawn in the graphs.

In the left pane of Fig. 4, users can further explore reported occupancies by selecting from the categories “subnational” or “facility type.” At the subnational level, the occupancies are shown by facility type, illustrating differences throughout the country. When selecting facility type, there is an option to explore the occupancies for the selected type for all countries. The occupancies for a country or region can be downloaded by percentile with the accompanying uncertainty.

3.1 Model variability

When processing large open source datasets, random sampling has been implemented to determine the requisite number of observation necessary to capture the variability in building occupancy for that building function and geographic area.

For example, Fig. 5 displays the daytime probability distributions for the 10th, 50th, and 90th percentiles for Tokyo schools and the number of observations published into the PDT system, beginning with the national level baseline model for Japan schools shown in the table in Fig. 5 for the 10th, 50th and 90th percentile ranges and the subsequent refinement of the baseline model as the number of models increases. The 10th percentile converges to zero, 50th percentile is close to 1, and the 50th around 15 people/1000 ft2. Most of the change occurs within the first 10 observation publications, with a smaller progression after 10 observations.

Fig. 5
figure 5

Model learning begins with the baseline model and number of models that refine the baseline estimate and the resulting range of occupancies for Low, Mid, and High (10th, 50th, and 90th percentile, respectively) and the change in the median range shown in the table and the bottom left chart. In the bottom right corner, visualization of probability distributions and ensembles for Tokyo schools for the 50th percentile beginning with the baseline model in red and the refinement of the probability distributions to the green ensemble (#68)

The baseline model depicts a conservative baseline occupancy for schools in Tokyo representing the uncertainty with that model. The upper bound occupancy on the x axis is 200 people/1000 ft2, and based on learning from one school, the occupancy is refined quite substantially to an upper bound of 20 people/1000 ft2. There is more uncertainty about the 50th percentile from learning with the one observation than at the baseline estimate. This is expected since the baseline model and the occupancy for one school are different enough to increase (or flatten the alpha beta plane) the number of possible occupancies, resulting in more uncertainty about the variability in occupancy for schools in Tokyo. As nine more observation are published into the system, the uncertainty about the 50th percentile is reduced. After 40 schools are published into the system, the uncertainty ensemble appears to remain nearly the same as with 10 schools, but the ensemble continues to tighten, resulting in fewer probabilities concerning the school occupancy. After 68 data points are added, the overall trend shows a tightening of the distributions toward one best possible answer.

The changes are quantitatively observed through the Change in Median Range graph in Fig. 5. When a new observation is published into the system, the resulting probability distribution changes shape. However, if there is enough uncertainty in the data input in the published model, the reported median occupancy may change even if the median range does not. This is the result of having more than one best median probability estimate as the sampling method is set to randomly select from all possible median probability distributions at each update.

4 Spatial application

To further explore use, validate, and interpret the results of the PDT building occupancies and accompanying uncertainty, we spatially apply PDT values for a small research area that reports population and economic data. The research area is Shoto (Fig. 6), a small primarily residential district in the Shibuya ward of Tokyo prefecture. Adjoining districts of Shibuya are home to the two busiest railway stations in the world and one of Tokyo’s primary shopping corridors. Given its proximity to some of the most bustling areas of the city, property values in Shoto are high, and the neighborhood has an upscale character.

Fig. 6
figure 6

The Shoto (1 Chome) study area shown within Japan (far left) and Tokyo prefecture (middle) and Shoto (right) with building footprints mapped to increase visibility

Each city, town, or village maintains a Basic Resident Register in which residents are required to register and update if they move or the household occupancy changes. As of January 1, 2020, the total residential population for Shoto is 1414 (Tokyo Metropolitan Government 2020), and the number of households is 688, which indicates a household size of 2.1. For Shoto 1 Chome, the 2015 census (2015 census reference) reports 1316 residents and 678 households, while the 2014 economic census reports 145 businesses and 1460 employees.

To map the built area of Shoto, we first downloaded building footprints from the collaborative mapping website OpenStreetMap (www.openstreetmap.org) and opened them in the geospatial processing program ArcMap. These data were then displayed over the most recent available Digital Globe Worldview-3 (WV3) satellite imagery for the neighborhood, which was dated April 4, 2020. The downloaded footprints were adjusted as necessary to encompass newly built structures as determined by observation of the WV3 imagery. Once each building in the neighborhood was mapped, we used the area calculation tool within ArcMap to measure the floor area of the structures.

Next, we used ground imagery, primarily obtained from the Google Street View service, to ascertain building functions and numbers of floors. Google Street View has extensive coverage in the neighborhood, allowing nearly every building to be observed from street level. Some buildings, however, were located too far from a street to be viewed, while others were obscured behind large walls or trees. In those cases, floor numbers were determined by close examination of satellite imagery of the properties.

POI and photos provided by OpenStreetMap and Google were also used to assist in determining the building functions served by each structure in the district (Fig. 7).

Fig. 7
figure 7

Building types for Shoto determined through multiple open sources. Mixed use buildings areas were determined separately for application of the PDT building occupancies

The national level PDT building occupancies were downloaded (https://pdt.ornl.gov; May 20, 2020) for application in the spatial mapping process for Shoto (Table 1) and the uncertainty included for the 50th percentile.

Table 1 PDT building occupancies for the 10th, 50th and 90th percentile and night and day

To validate the PDT reported building occupancies, we chose to use residential subnational estimates for Shoto. The PDT building occupancies for night and day for the 10th, 50th and 90th percentiles were applied to the area of the single-use buildings. For mixed-use buildings, PDT building occupancies were applied separately to those areas and subsequently added together to report a total population for each building and percentile. Table 2 lists areas for residential versus non-residential structures as well as the resulting building occupancies (people/m2).

Table 2 Area (m2) of residential and non-residential for Shoto and the resulting residential, non-residential, and the total building occupancy for day and night

The resulting building occupancies for the Shoto area are mapped and shown in Fig. 8 for the 10th, 50th, and 90th percentile for night and day. First, the building occupancy range varies from 0 to over 232.0 people/m2, as shown in the legend. The theater reports the highest occupancy for the 90th percentile at 1443.0 people/m2. This range changes for each percentile and for night and day, and expectedly, an increase in building occupancy reported from the 10th to the 90th percentile for both night and day is represented by the blue at the low end of the occupancy to the red at the high end of the occupancy. The Total Building Occupancy (TBO) also reports the increase from 10 to 90th for both night and day including the Residential (Res) total for Shoto.

Fig. 8
figure 8

PDT building occupancies (ppl/1000 sqft) probability estimates for the 10th, 50th (the low and high about the 50th; the uncertainty range) and 90th percentile, and night and day for the Shoto area

Non-residential results are found in Table 2 along with Residential and TBO. The day TBO is higher than the night for each percentile, and this is expected given the number of employees alone is higher than the residential count in the area (see Table 1).

Further, the Residential Population for this area is reported as 1414 and found to be between the 10th and 50th percentile (Table 1), within the overall lower range of PDT residential building occupancies reported. One of the challenges in applying PDT building occupancy to the area (footprint and height) of apartments or hostels is that they may include areas that do not contribute to the dwelling. For example, an apartment building contains hallways, staircases, and storage, which are not accounted for in the census-reported dwelling size. This difference in area can cause an increase in the occupancy estimate if we are applying the PDT Multi Family estimate to the area of the apartment building (footprint and height). If the area of apartments within a specific building is known, it is best to apply building occupancies based on that information versus the size of the total building. However, that information is not readily available for all buildings.

Occupancies in Non-Residential areas of Shoto are more of a challenge to validate, even with the available economic information. The reported 1460 employees are found within the resulting night and day range of occupancies, and given the type of businesses in the area, such as hotels, restaurants, and theaters, there is an expectation that some percentage of employees will work overnight. If the employees were reported for different times of day—night or day—a comparison of the reported TBO could be performed. Unfortunately, the number of employees is reported for Shoto without distinction.

The uncertainty range of a probability distribution or percentile is available for exploration both to map out options or better understand reported occupancies. The PDT uncertainty indicates either data availability is scarce, there is uncertainty about the data at the input level, or both. In Table 1, the night and day uncertainty about the 50th percentile is found in the columns labeled 50th Mid Low and 50th Mid High, while the 50th Mid is the best midpoint reported for that range. For Residential, there is no uncertainty about the data reported as the census or housing surveys statistically capture the range of housing sizes for a geographic area, which is the best possible answer available. However, the Non-Residential categories do report an uncertainty range. For example, an urban museum for daytime reports 1.95 people/m2 for the midpoint and ranges from a low of 1.59 people/m2 and a high of 2.3 people/m2.

5 Conclusions

This paper sets out to inform the reader about scaling PDT to report building occupancies at the national and subnational level throughout the world through expansion of observation model classes. The resulting occupancy estimates include transparent reporting pertaining to model development and uncertainty. Refinements to the school baseline estimates in Tokyo were explored, paying particular attention to the roles of variability and uncertainty in formulating the estimates. Finally, spatial mapping of the building occupancies was employed for validation and interpretation of the results.

The PDT learning system reports night, day, and episodic probability occupancy estimates for every country in the world for over 50 building types. There are several classes of observation models within the PDT system to accommodate the range of disparate data or proxies, accounting for the lack of data, for countries around the world. The probability estimates report uncertainty, which is captured at the data input level and propagated throughout the learning process. Complete transparency accompanies the resulting estimates and includes formulas used within the models, the open source data inputs, and any other necessary information about the model.

Investigating the observations for 68 Tokyo schools reveals how the Bayesian process refines the established baseline occupancy estimates. The greatest refinement of the baseline estimate occurs within the first 10 models, with smaller, but continued, refinement thereafter. More investigation is needed into how variability of occupancies for observations affects the learning of PDT baseline estimates and learning differences over the various building functions for the “best possible answer”. If the variability of the models is quantified, that could affect the number of observations required to adequately inform the baseline occupancies.

A validation of spatial application of PDT occupancy estimates was performed in the Shoto 1 Chome area in Tokyo, Japan. The results revealed that the reported Shoto residential counts were within the range of PDT residential occupancy estimates. Validating the other building types is more of a challenge because exact counts of people visiting an area is not readily available. However, the PDT results for non-residential building types does capture the reported number of employees within the 10th to 90th population range. Those are promising results in developing population estimates from the PDT occupancy estimates for capturing transient and residential populations within an area. Further research into the variability of building occupancy—over a 24 h period—should further inform our knowledge of building use and patterns for validation. A general reference of use and comparison against mobility data is a possible step towards validation of building occupancies. Small area studies of mobility data in comparison with known building occupants has shown promise for validation of ambient occupancies.

While building footprints with height or 3D buildings are not yet available worldwide, they continue to become more available. Application of PDT building occupancies is a viable option to determine populations other than through census or surveys. The PDT building occupancies account for transient populations through facility types such as airports, bus and train stations, seaports, hotels, and urban and rural museums.

Additionally, introducing uncertainty for the date of open source models would weight the model impact on the probable occupancy estimates and, over time, phase out the oldest models unless updates to the model were made. In addition, there is opportunity to conduct sensitivity analyses into the reported uncertainty to identify the sources of error to direct data input updates. Finally, data discovery and subsequent model updates to accommodate new sources of data will continue, resulting in an expansion of the model classes.