Statistical Physics of model selection and validation: cities as complex systems

StatPhys4Cities

Description

At least since the scientific revolution, interpretable mathematical models have been instrumental for advancing our understanding of the world. The “big data” era held the promise of facilitating the discovery of similarly interpretable mathematical models of natural and socio-economic systems that were previously not amenable to quantitative analysis. Yet, so far we have not seen such an explosion of new interpretable mathematical models. This is in part because machine learning models are de facto taking their place. However, because most machine learning algorithms are not interpretable, an uncontrolled use of such approaches can have unwanted consequences when model outcomes are directly linked to decisions.

Statistical physics approaches precisely rely on using interpretable micro-scale models to understand macro-scale behavior and as such they are uniquely positioned to lay the foundations of alternative algorithms for interpretable model selection and validation that will learn from data but that will significantly differ from the machine learning we know today.

A particular setting in which the need of better interpretable models is critical is that of socio-economic systems, and especially cities, where understanding the micro-motives of human behavior is necessary to explain the macro-behavior of those systems, and to inform policy-making decisions. Unfortunately, despite the fact the statistical physics contributions to modeling urban phenomena, most of the used tools do not go beyond the “bottom-up” theoretical metaphor. However, because of the expected growth of cities at a global scale in the next decade and the fact that more urban data is available, there is a pressing need to be able to obtain interpretable models for urban social contexts which are informed by data and that can be validated within an urban setting.

StatPhys4Cities will take on these challenges in a coordinated effort that will contribute and advance the research of urban-related problems from a statistical physics approach that combines models and methods from network theory, stochastic processes, and critical phenomena, among others with a data-driven approach. Specifically, StatPhys4Cities has two overarching goals:

1) To develop interpretable model selection and validation tools using statistical physics principles. The tools should also inform the process of obtaining further data to answer a specific research questions.

2) To gain a better understanding about mobility, welfare and inequalities within cities through the analysis/modeling/interpretation of existing data and the acquisition of new data specific to these problems.

Our developments are expected to cross disciplinary boundaries because of the pressing need of scientists in life and social sciences to exploit the large amounts of data they have. StatPhys4Cities will spearhead the scientific community working on cities in the adoption of powerful state-of-the-art methodologies for model building from data. Results from StatPhys4Cities will also have a deep impact on citizens concerned about specific urban issues in mobility, welfare and inequality by enabling a common participatory and inclusive research (including gender issues and vulnerable groups). Policy makers will also receive novel approaches to accurately model and understand the effect of their policies and have the capacity to anticipate future scenarios based on scientific grounds.

Highlights

Reformulating computational social science with citizen social science: the case of a community-based mental health care research

Bonhoure, I (Bonhoure, Isabelle); Cigarini, A (Cigarini, Anna); Vicens, J (Vicens, Julian); Mitats, B (Mitats, Barbara); Perello, J (Perello, Josep)

Computational social science is being scrutinised and some concerns have been expressed with regards to the lack of transparency and inclusivity in some of the researches. However, how computational social science can be reformulated to adopt participatory and inclusive practices? And, furthermor...

Journal

Quantifying the importance and location of SARS-CoV-2 transmission events in large metropolitan areas

Aleta, A (Aleta, Alberto); Martin-Corral, D (Martin-Corral, David); Bakker, MA (Bakker, Michiel A.); Piontti, APY (Piontti, Ana Pastore Y.); Ajelli, M (Ajelli, Marco); Litvinova, M (Litvinova, Maria); Chinazzi, M (Chinazzi, Matteo); Dean, NE (Dean, Natalie E.); Halloran, ME (Halloran, M. Elizabeth); Longini, IM (Longini Jr, Ira M.); Pentland, A (Pentland, Alex); Vespignani, A (Vespignani, Alessandro); Moreno, Y (Moreno, Yamir); Moro, E (Moro, Esteban)

Detailed characterization of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission across different settings can help design less disruptive interventions. We used real-time, privacy-enhanced mobility data in the New York City, NY and Seattle, WA metropolitan areas to build a ...

Journal

Gene regulatory network inference in long-lived C. elegans reveals modular properties that are predictive of novel ageing genes

Suriyalaksh, M, Raimondi, C, Mains, A, Segonds-Pichon, A, Mukhtar, S, Murdoch, S, Aldunate, R, Krueger, F, Guimerà, R, Andrews, S, M. Sales-Pardo, M, Casanueva, O.

We design a “wisdom-of-the-crowds” GRN inference pipeline, and couple it to complex network analysis, to understand the organisational principles governing gene regulation in long-lived glp-1/Notch C. elegans. The GRN has three layers (input, core, output) and is topologically equivalent to bow-t...

Journal

People

Roger Guimerà

Universitat Rovira i Virgili - ICREA Research Professor

roger.guimera@urv.cat

@sees_lab

Site

Marta Sales-Pardo

Universitat Rovira i Virgili - Associate Professor

marta.sales@urv.cat

@sees_lab

Site

Esteban Moro

Universidad Carlos III de Madrid - Associate Professor

emoro@math.uc3m.es

@estebanmoro

Site

Josep Perelló

Universtiat de Barcelona - Associate Professor

josep.perello@ub.edu

@josperello

Site

Jordi Duch

Universitat Rovira i Virgili - Associate Professor

jordi.duch@urv.cat

@tanisjones

Miquel Montero

Universitat de Barcelona - Associate Professor

miquel.montero@ub.edu

Site

Jaume Masoliver

Universitat de Barcelona - Professor

jaume.masoliver@ub.edu

Site

Javier Villarroel

Universidad de Salamanca - Professor

javier@usal.es

Site

Iñaki Úcar

Universidad Carlos III - Postdoctoral Researcher

inaki.ucar@uc3m.es

@Enchufa2

Publications

Computational social science is being scrutinised and some concerns have been expressed with regards to the lack of transparency and inclusivity in some of the researches. However, how computational social science can be reformulated to adopt participatory and inclusive practices? And, furthermore, which aspects shall be carefully considered to make possible this reformulation? We present a practical case that addresses the challenge of collectively studying social interactions within community-based mental health care. This study is done by revisiting and revising social science methods such as social dilemmas and game theory and by incorporating the use of digital interfaces to run experiments in-the-field. The research can be framed within the emergent citizen social science or social citizen science where shared practices are still lacking. We have identified five key steps of the research process to be considered to introduce participatory and inclusive practices: research framing, research design, experimental spaces, data sources, and actionable knowledge. Social dilemmas and game theory methods and protocols need to be reconsidered as an experiential activity that enables participants to self-reflect. Co-design dynamics and the building of a working group outside the academia are important to initiate socially robust knowledge co-production. Research results should support evidence-based policies and collective actions put forward by the civil society. The inclusion of underserved groups is discussed as a way forward to new avenues of computational social science jointly with intricate ethical aspects. Finally, the paper also provides some reflections to explore the particularities of a further enhancement of social dimensions in citizen science.
[Visit journal]
Detailed characterization of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission across different settings can help design less disruptive interventions. We used real-time, privacy-enhanced mobility data in the New York City, NY and Seattle, WA metropolitan areas to build a detailed agent-based model of SARS-CoV-2 infection to estimate the where, when, and magnitude of transmission events during the pandemic's first wave. We estimate that only 18% of individuals produce most infections (80%), with about 10% of events that can be considered superspreading events (SSEs). Although mass gatherings present an important risk for SSEs, we estimate that the bulk of transmission occurred in smaller events in settings like workplaces, grocery stores, or food venues. The places most important for transmission change during the pandemic and are different across cities, signaling the large underlying behavioral component underneath them. Our modeling complements case studies and epidemiological data and indicates that real-time tracking of transmission events could help evaluate and define targeted mitigation policies.
[Visit journal]
We design a “wisdom-of-the-crowds” GRN inference pipeline, and couple it to complex network analysis, to understand the organisational principles governing gene regulation in long-lived glp-1/Notch C. elegans. The GRN has three layers (input, core, output) and is topologically equivalent to bow-tie/hourglass structures prevalent among metabolic networks. To assess the functional importance of structural layers, we screened 80% of regulators and discovered 50 new ageing genes, 86% with human orthologues. Genes essential for longevity—including ones involved in insulin-like signalling (ILS)—are at the core, indicating that GRN’s structure is predictive of functionality. We used in vivo reporters, and a novel functional network covering 5,497 genetic interactions to make mechanistic predictions. We used genetic epistasis to test some of these predictions, uncovering a novel transcriptional regulator, sup-37 that works alongside DAF-16/FOXO. We present a framework with predictive power that can accelerate discovery in C. elegans and potentially humans.
[Visit journal]
Predicting countries’ energy consumption and pollution levels precisely from socio-economic drivers will be essential to support sustainable policy-making in an effective manner. Current predictive models, like the widely used STIRPAT equation, are based on rigid mathematical expressions that assume constant elasticities. Using a Bayesian approach to symbolic regression, here we explore a vast amount of suitable mathematical expressions to model the link between energy-related impacts and socio-economic drivers. We find closed-form analytical expressions that outperform the well-established STIRPAT equation and whose mathematical structure challenges the assumption of constant elasticities adopted in the literature. Our work unfolds new avenues to apply machine learning algorithms to derive analytical expressions from data, which could help find better models and solutions in energy-related problems.
[Visit journal]
We design a "wisdom-of-the-crowds" GRN inference pipeline and couple it to complex network analysis to understand the organizational principles governing gene regulation in long-lived glp-1/Notch Caenorhabdities legans. The GRN has three layers (input, core, and output) and is topologically equivalent to bowtie/hourglass structures prevalent among metabolic networks. To assess the functional importance of structural layers, we screened 80% of regulators and discovered 50 new aging genes, 86% with human orthologues. Genes essential for longevity-including ones involved in insulin-like signaling (ILS)-are at the core, indicating that GRN's structure is predictive of functionality. We used in vivo reporters and a novel functional network covering 5,497 genetic interactions to make mechanistic predictions. We used genetic epistasis to test some of these predictions, uncovering a novel transcriptional regulator, sup-37, that works alongside DAF-16/FOXO. We present a framework with predictive power that can accelerate discovery in C. elegans and potentially humans.
[Visit journal]
Network inference is the process of learning the properties of complex networks from data. Besides using information about known links in the network, node attributes and other forms of network metadata can help solve network inference problems. Indeed, several approaches have been proposed to introduce metadata into probabilistic network models and to use them to make better inferences. However, we know little about the effect of such metadata in the inference process. Here, we investigate this issue. We find that, rather than affecting inference gradually, adding metadata causes a crossover in the inference process and in our ability to make accurate predictions, from a situation in which metadata do not play any role to a situation in which metadata completely dominate the inference process. When network data and metadata are partly correlated, metadata optimally contributes to the inference process at the crossover between datadominated and metadata-dominated regimes.
[Visit journal]
Reliable and timely information on socio-economic status and divides is critical to social and economic research and policing. Novel data sources from mobile communication platforms have enabled new cost-effective approaches and models to investigate social disparity, but their lack of interpretability, accuracy or scale has limited their relevance to date. We investigate the divide in digital mobile service usage with a large dataset of 3.7 billion time-stamped and geo-referenced mobile traffic records in a major European country, and find profound geographical unevenness in mobile service usage-especially on news, e-mail, social media consumption and audio/video streaming. We relate such diversity with income, educational attainment and inequality, and reveal how low-income or low-education areas are more likely to engage in video streaming or social media and less in news consumption, information searching, e-mail or audio streaming. The digital usage gap is so large that we can accurately infer the socio-economic status of a small area or even its Gini coefficient only from aggregated data traffic. Our results make the case for an inexpensive, privacy-preserving, real-time and scalable way to understand the digital usage divide and, in turn, poverty, unemployment or economic growth in our societies through mobile phone data.
[Visit journal]
We present an already tested protocol from a large-scale air quality citizen science campaign (xAire, 725 measurements, see Ref. [1]). A broad partnership with 1,650 people from communities including 18 primary schools in Barcelona (Spain) provided the capacity to obtain unprecedented high-resolution NO2 levels. Communities followed the protocol to select measurement points and obtain NO2 levels from outdoor locations n = 671, playgrounds n = 31, and inside school buildings (primarily classrooms) n = 23. Data was calibrated and annualized with concentration levels from the city s acute accent automatic air quality monitoring reference stations [2]. (C) 2021 Published by Elsevier B.V.
[Visit journal]
Dataset from a large-scale air quality citizen science campaign is presented (xAire, 725 measurements, see Ref. [1]). A broad partnership with 1650 citizens from communities around 18 primary schools across Barcelona (Spain) provided the capacity to obtain unprecedented high-resolution NO2 levels which had in turn the capacity to provide an updated asthma Health Impact Assessment. Nitrogen dioxide levels being obtained in a 4-week period during February and March 2018 with Palmes' diffusion samplers are herein provided. Dataset includes NO2 levels from outdoor locations n=671, playgrounds n=31, and inside school buildings (mostly classrooms) n=23. Data was calibrated and annualized with concentration levels from automatic reference stations. It is shown that NO2 levels vary considerably with at some cases very high levels. Strong differences might also however be explained by the fact that ambient air pollution is reduced exponentially with distance from an emission source like traffic meaning that two samplers located about 100 m away can measure a tenfold difference concentration level. (C) 2021 The Authors. Published by Elsevier Inc.
[Visit journal]
Many studies have shown that there are regularities in the way human beings make decisions. However, our ability to obtain models that capture such regularities and can accurately predict unobserved decisions is still limited. We tackle this problem in the context of individuals who are given information relative to the evolution of market prices and asked to guess the direction of the market. We use a networks inference approach with stochastic block models (SBM) to find the model and network representation that is most predictive of unobserved decisions. Our results suggest that users mostly use recent information (about the market and about their previous decisions) to guess. Furthermore, the analysis of SBM groups reveals a set of strategies used by players to process information and make decisions that is analogous to behaviors observed in other contexts. Our study provides and example on how to quantitatively explore human behavior strategies by representing decisions as networks and using rigorous inference and model-selection approaches.
[Visit journal]
Traditional understanding of urban income segregation is largely based on static coarse-grained residential patterns. However, these do not capture the income segregation experience implied by the rich social interactions that happen in places that may relate to individual choices, opportunities, and mobility behavior. Using a large-scale high-resolution mobility data set of 4.5 million mobile phone users and 1.1 million places in 11 large American cities, we show that income segregation experienced in places and by individuals can differ greatly even within close spatial proximity. To further understand these fine-grained income segregation patterns, we introduce a Schelling extension of a well-known mobility model, and show that experienced income segregation is associated with an individual's tendency to explore new places (place exploration) as well as places with visitors from different income groups (social exploration). Interestingly, while the latter is more strongly associated with demographic characteristics, the former is more strongly associated with mobility behavioral variables. Our results suggest that mobility behavior plays an important role in experienced income segregation of individuals. To measure this form of income segregation, urban researchers should take into account mobility behavior and not only residential patterns. Urban income segregation is often discussed in terms of where people live. Here, the authors show that the way people experience income segregation is also associated with their mobility patterns and the places they visit.
[Visit journal]
We consider a discrete-time random walk (x(t)) which, at random times, is reset to the starting position and performs a deterministic motion between them. We show that the quantity Pr (x(t +1) = n _1 vertical bar x(t) = n), n -> infinity determines if the system is averse, neutral or inclined towards resetting. It also classifies the stationary distribution. Double barrier probabilities, first passage times and the distribution of the escape time from intervals are determined.
[Visit journal]
We develop the process of discounting when underlying rates follow a jump-diffusion process, that is, when, in addition to diffusive behavior, rates suffer a series of finite discontinuities located at random Poissonian times. Jump amplitudes are also random and governed by an arbitrary density. Such a model may describe the economic evolution, specially when extreme situations occur (pandemics, global wars, etc.). When, between jumps, the dynamical evolution is governed by an Ornstein-Uhlenbeck diffusion process, we obtain exact and explicit expressions for the discount function and the long-run discount rate and show that the presence of discontinuities may drastically reduce the discount rate, a fact that has significant consequences for environmental planning. We also discuss as a specific example the case when rates are described by the continuous time random walk.
[Visit journal]
A critical question relevant to the increasing importance of crowd-sourced-based finance is how to optimize collective information processing and decision-making. Here, we investigate an often under-studied aspect of the performance of online traders: beyond focusing on just accuracy, what gives rise to the trade-off between risk and accuracy at the collective level? Answers to this question will lead to designing and deploying more effective crowd-sourced financial platforms and to minimizing issues stemming from risk such as implied volatility. To investigate this trade-off, we conducted a large online Wisdom of the Crowd study where 2037 participants predicted the prices of real financial assets (S&P 500, WTI Oil and Gold prices). Using the data collected, we modeled the belief update process of participants using models inspired by Bayesian models of cognition. We show that subsets of predictions chosen based on their belief update strategies lie on a Pareto frontier between accuracy and risk, mediated by social learning. We also observe that social learning led to superior accuracy during one of our rounds that occurred during the high market uncertainty of the Brexit vote.
[Visit journal]
e consider a discrete-time random walk (x(t)) which, at random times, is reset to the starting position and performs a deterministic motion between them. We show that the quantity Pr (x(t +1) = n _1 vertical bar x(t) = n), n -> infinity determines if the system is averse, neutral or inclined towards resetting. It also classifies the stationary distribution. Double barrier probabilities, first passage times and the distribution of the escape time from intervals are determined.
[Visit journal]
The COVID-19 pandemic is causing mass disruption to our daily lives. We integrate mobility data from mobile devices and area-level data to study the walking patterns of 1.62 million anonymous users in 10 metropolitan areas in the United States. The data covers the period from mid-February 2020 (pre-lockdown) to late June 2020 (easing of lockdown restrictions). We detect when users were walking, distance walked and time of the walk, and classify each walk as recreational or utilitarian. Our results reveal dramatic declines in walking, particularly utilitarian walking, while recreational walking has recovered and even surpassed pre-pandemic levels. Our findings also demonstrate important social patterns, widening existing inequalities in walking behavior. COVID-19 response measures have a larger impact on walking behavior for those from low-income areas and high use of public transportation. Provision of equal opportunities to support walking is key to opening up our society and economy. Mobility restrictions implemented to reduce the spread of COVID-19 have significantly impacted walking behavior. In this study, the authors integrated mobility data from mobile devices and area-level data to study the walking patterns of 1.62 million anonymous users in 10 US metropolitan areas.
[Visit journal]
We develop the process of discounting when underlying rates follow a jump-diffusion process, that is, when, in addition to diffusive behavior, rates suffer a series of finite discontinuities located at random Poissonian times. Jump amplitudes are also random and governed by an arbitrary density. Such a model may describe the economic evolution, specially when extreme situations occur (pandemics, global wars, etc.). When, between jumps, the dynamical evolution is governed by an Ornstein-Uhlenbeck diffusion process, we obtain exact and explicit expressions for the discount function and the long-run discount rate and show that the presence of discontinuities may drastically reduce the discount rate, a fact that has significant consequences for environmental planning. We also discuss as a specific example the case when rates are described by the continuous time random walk.
[Visit journal]
Random walks with invariant loop probabilities comprise a wide family of Markov processes with site-dependent, one-step transition probabilities. The whole family, which includes the simple random walk, emerges from geometric considerations related to the stereographic projection of an underlying geometry into a line. After a general introduction, we focus our attention on the elliptic case: random walks on a circle with built-in reflexing boundaries.
[Visit journal]
The COVID-19 pandemic is causing mass disruption to our daily lives. We integrate mobility data from mobile devices and area-level data to study the walking patterns of 1.62 million anonymous users in 10 metropolitan areas in the United States. The data covers the period from mid-February 2020 (pre-lockdown) to late June 2020 (easing of lockdown restrictions). We detect when users were walking, distance walked and time of the walk, and classify each walk as recreational or utilitarian. Our results reveal dramatic declines in walking, particularly utilitarian walking, while recreational walking has recovered and even surpassed pre-pandemic levels. Our findings also demonstrate important social patterns, widening existing inequalities in walking behavior. COVID-19 response measures have a larger impact on walking behavior for those from low-income areas and high use of public transportation. Provision of equal opportunities to support walking is key to opening up our society and economy. Mobility restrictions implemented to reduce the spread of COVID-19 have significantly impacted walking behavior. In this study, the authors integrated mobility data from mobile devices and area-level data to study the walking patterns of 1.62 million anonymous users in 10 US metropolitan areas.
[Visit journal]
In the United States (US), low-income workers are being pushed away from city centers where the cost of living is high. The effects of such changes on labor mobility and housing price have been explored in the literature. However, few studies have focused on the occupations and specific skills that identify the most susceptible workers. For example, it has become increasingly challenging to fill the service sector jobs in the San Francisco (SF) Bay Area because appropriately skilled workers cannot afford the growing cost of living within commuting distance. With this example in mind, how does a neighborhood's skill composition change as a result of higher housing prices? Are there certain skill sets that are being pushed to the geographical periphery of a city despite their essentialness to the city's economy? Our study focuses on the impact of housing prices with a granular view of skills compositions to answer the following question: Has the density of cognitive skill workers been increasing in a gentrified area? We hypothesize that, over time, low-skilled workers are pushed away from downtown or areas where high-skill establishments thrive. Our preliminary results show that high-level cognitive skills are getting closer to the city center indicating adaptation to the increase of median housing prices as opposed to low-level physical skills that got further away. We examined tracts that the literature indicates as gentrified areas and found a pattern in which there is a temporal increase in median housing prices and the number of business establishments coupled with an increase in the percentage of skilled cognitive workers.
[Visit journal]
Many studies have shown that there are regularities in the way human beings make decisions. However, our ability to obtain models that capture such regularities and can accurately predict unobserved decisions is still limited. We tackle this problem in the context of individuals who are given information relative to the evolution of market prices and asked to guess the direction of the market. We use a networks inference approach with stochastic block models (SBM) to find the model and network representation that is most predictive of unobserved decisions. Our results suggest that users mostly use recent information (about the market and about their previous decisions) to guess. Furthermore, the analysis of SBM groups reveals a set of strategies used by players to process information and make decisions that is analogous to behaviors observed in other contexts. Our study provides and example on how to quantitatively explore human behavior strategies by representing decisions as networks and using rigorous inference and model-selection approaches.
[Visit journal]
Cities are the innovation centers of the US economy, but technological disruptions can exclude workers and inhibit a middle class. Therefore, urban policy must promote the jobs and skills that increase worker pay, create employment, and foster economic resilience. In this paper, we model labor market resilience with an ecologically-inspired job network constructed from the similarity of occupations' skill requirements. This framework reveals that the economic resilience of cities is universally and uniquely determined by the connectivity within a city's job network. US cities with greater job connectivity experienced lower unemployment during the Great Recession. Further, cities that increase their job connectivity see increasing wage bills, and workers of embedded occupations enjoy higher wages than their peers elsewhere. Finally, we show how job connectivity may clarify the augmenting and deleterious impact of automation in US cities. Policies that promote labor connectivity may grow labor markets and promote economic resilience. Recent technological, social, and educational changes are profoundly impacting our work, but what makes labour markets resilient to those labour shocks? Here, the authors show that labour markets resemble ecological systems whose resilience depends critically on the network of skill similarities between different jobs.
[Visit journal]