Harmonizing Spatial Data For Geospatial Modeling Part I

Having had a bulk of my statistical training done under the the tutelage of the Institute for Health Metrics and Evaluation (IHME), I was very much indoctrinated into the idea that more data leads to better estimates, even when that data is perfect. Much of my work tries to deal with how to leverage “imperfect” data in order to be utilized for more informed estimates of demographic phenomenon. One example of this that occurs in geospatial health modeling is when data is not geolocated to a particular point location but rather to a larger administrative area. If you are not familiar with geospatial modeling (or model based geostatistics) its the process of trying to estimate the underlying stochastic process that generates observations by leveraging correlations in space. Basically, we believe that events that occur near each other are more likely to be similar than events that occur farther away from each other. In my work in estimating spatial differences in child mortality, we can imagine a whole host of reasons on why that may be true but they usually correlate to other upstream factors related to health also being spatially autocorrelated, distance to a health facility, policies of a administrative region, socio-economic status, etc.

One way we can represent these spatial auto-correlations is via a Gaussian Process (GP). A GP is a N-dimensional process that gives a value to each point in the process, where the relationship between any two points may be described as multivariate normal. Traditionally in spatial statistics, a matern covariance 2D Gaussian process is used to represent how we may expect a process to be correlated in space. Whats neat about these process is that we can combine multiple process together to describe how higher order dimensions relate to another, think space and time. All of this is to say is that GP (and Gaussian Markov Random Field (GMRF) approximations to them) give us a powerful tool to model how we might expect spatial temporal process to operate.

For me, this is especially important in the context of evaluating changes in under five mortality or \(~_5q_0\), the probability of a child dying before reaching the age of five. Over the past 30 years many countries have seen a dramatic decline in their child mortality rates and a special focus was placed on countries who reduced their child mortality rate by two thirds between 1990 and 2015, a highlighted target of the Millennium Development Goals (MDGs). What does not get captured with these country level measures is the inequality of health outcomes within a country. We know that socio-economic status, health care options, and a number of other factors effect heterogeneity of health outcomes within a country, and who lies at the margins of these health outcomes is often non-random. Geography often acts as a strong proxy for these factors as well as others, as individuals with similar demographic characteristics tend to be clustered together in space. Using geolocated data, such as data that has lat-long coordinates associated with them, on birth histories of women we are able to estimate how child mortality risk is correlated in space and get a better idea of how relatively small administrative units, such as counties and municipalities, differ in their health outcomes. A great example of this kind of work done by Jon Wakefield at the University of Washington is found here.

The results of this type of analysis is that it enables us to better understand variation and inequality in health outcomes experienced geographically, however, the data requirements for this work are often quite limiting. In order for a data point to be considered for analysis it often need to come form complete birth histories, be representative of the area that the survey was taken from, and of course be geolocated. Furthermore, this analysis often doesn't take into account other forms of information that we have on data child mortality data such as through vital registration systems, but I will defer that conversation for a later time. These restrictions inhibit the surveys that we can incorporate into our statistical models by an alarming rate and limit the time periods and geographies that we can talk about for geospatial differences in child mortality. In order to get around this, several researchers have come up with different approaches to utilize a larger corpus of data. An example of this approach is presented in Nick Golding's paper Mapping under-5 and neonatal mortality in Africa which lays out the methodology for much of the geospatial work that comes out of IHME, a method they call spatial resampling. Another approach comes from Utazi et al out of the WorldPop group in South Hampton which can be best described as a variable ,right hand side (RHS), integration approach. A couple other groups have taken a stab at this problem as well including Wilson and Wakefield, however, studies comparing these methods and which is most appropriate have not been thoroughly done.