Harmonizing Spatial Data For Geospatial Modeling Part II

To complicate the situation even more I am going to throw in one more model to the mix, a spatial mixture model, which accounts for the many possible location from an administrative area that a sample can be drawn from if it is sufficiently discretized. I wont go into too much detail here but I will have a formal write up soon. In the mean time the model functional form for areal data sampled from a continuous geographic field would have a statistical model that would look like a mixture of binomials as follows.

\[Y^\star_i ~\dot \sim ~ \begin{cases} \text{Binomial}(p_{s_{j_1}, t_i}, N_i) \times q_{s_{j_1}}\\ \vdots \\ \text{Binomial}(p_{s_{j_J}, t_i}, N_i) \times q_{s_{j_J}} \ \end{cases} \text{for } j \in \mathcal{A}_i \\ \text{logit}(p_{s_j, t_i}) = \boldsymbol{\beta \cdot X_{s_j, t_i}} + \omega(s_j, t_i) \\ \boldsymbol{\omega}~\dot \sim~ \text{GMRF}(\boldsymbol{0}, Q^\mathcal{M} \otimes Q^\text{AR1}) \\ \kappa \sim \text{Log Normal}(0, 10) \\ \tau \sim \text{Log Normal}(0, 10) \\ \rho \sim \text{Logit Normal}(0, 10)\]

Where \(\mathcal{A}_i\) represents the administrative area that sample \(i\) was drawn from. This model is most suitable in the cases of Demographic and Health Survey (DHS) and Multiple Indicator Cluster Survey (MICS) data where a standard stratified cluster design is implemented, however, the geospatial coordinates have been redacted. But now all we are left with is another model to contend with in what model is best and my word contending that this is the correct modeling frame given the set of statistical assumptions. I could show you how this model outperforms the other models in a generic 1x1 spatial grid simulation, a favorite among spatial statisticians in showing model performance but a huge motivating factor for me in the work that I do is showing that their are potential policy implications for what these results mean. To that end a much more helpful simulation in my mind is one where we simulate data for a country using some baseline values that we know about child mortality and its variation over space and time and, using the sampling strategy implemented in surveys like MICS and DHS, sample from this simulated country example and run models on this simulated data. In this way we can get at the potential information loss we have of using model over the other, or alternatively how all the information we get from all models points us to the same conclusion. To do this we are going to simulate 6 years worth of spatial data on the Dominican Republic, a country with a number of health surveys that have been implemented since 2005. We can sample the same amount of child mortality data that we observe in the 2013 DHS survey which has geospatial coordinates given, 370 clusters with 3256 person-years, as well as the 2014 MICS sample which has geospatial coordinates masked, 684 clusters of data with 25,000 person years of data. Our simulated fields then look something like below and the goal of our samples and models is to reconstruct this field.

plot of chunk unnamed-chunk-1

At this point this post is going on pretty long but to summarize what we do is compare this new proposed model against previously proposed models, Utazi et al and Golding et al(IHME resampling), as well as compare against if we had ignored the data and if we had actually knew the locations of the samples. The metrics that we use to evaluate the results cover error, bias, and coverage such as in traditional simulation comparisons but we also look at the difference in estimates of policy relevant administrative areas, in this case provinces in the DR. While our new model proposed outperforms the previously mentioned models, except of course if we knew the actual location, in every metric, it is most notable in looking at the difference in our model vs other models for province level estimation. We see up to 30% improvement in the estimate of child mortality compared to the Golding et al estimates as well as a 30% reduction in uncertainty compared to ignoring the MICS data which does not have explicitly geolocated data. Though the results of these simulations showed a favorable nod towards our new model, I firmly believe that doing simulation work like this that shows the risk of model mis-specification for potentially policy informing results in global health should be at the forefront of our work. Stay tuned for the paper that will be submitted later next month with the full set of results. Also thanks to Jon Wakefield and his Spatio-Temporal Analysis Group and Simon Hay and the Local Burden of Disease team for counseling and advising me through this work.

plot of chunk unnamed-chunk-2

Menu