I am trying to account for spatial autocorrelation in a model in R. Each observation is a country for which I have the average latitude and longitude. Here's some sample data:
country <- c("IQ", "MX", "IN", "PY")
long <- c(43.94511, -94.87018, 78.10349, -59.15377)
lat <- c(33.9415073, 18.2283975, 23.8462264, -23.3900255)
Pathogen <- c(10.937891, 13.326284, 12.472374, 12.541716)
Answer.values <- c(0, 0, 1, 0)
data <- data.frame(country, long, lat, Pathogen, Answer.values)
I know spatial autocorrelation is an issue (Moran's i is significant in the whole dataset). This is the model I am testing (Answer Values (a 0/1 variable) ~ Pathogen Prevalence (a continuous variable)).
model <- glm(Answer.values ~ Pathogen,
na.action = na.omit,
data = data,
family = "binomial")
How would I account for spatial autocorrelation with a data structure like that?
There are a lot of potential answers to this. One easy(ish) way is to use mgcv::gam() to add a spatial smoother. Most of your model would stay the same:
library(mgcv)
gam(Answer.values ~ Pathogen +s([something]),
family="binomial",
data=data)
where s([something]) is some form of smooth spatial term. Three possible/reasonable choices would be:
a spherical spline (?mgcv::smooth.construct.sos.smooth.spec), which takes lat/long as input; this would be useful if (1) you have data over a significant fraction of the earth's surface (so that a smoother that constructs a 2D planar spatial smooth is less reasonable); (2) you want to account for distance between locations in a continuous way
a Markov random field (?mgcv::smooth.construct.mrf.smooth.spec). This is essentially the spatial analogue of a discrete order-1 autoregressive structure (i.e. countries are directly correlated only with their direct neighbours, however you choose to define that). In order to do this you have to come up somehow with a neighbourhood list (i.e. a list of countries, where the elements are lists of countries that are neighbours of the original countries). You could do this however you like, e.g. by finding nearest neighbours geographically. (Check out some introductions to spatial statistics/spatial data analysis in R.) (On the other hand, if you're testing Moran's I then you've presumably already come up with some way to identify first-order neighbours ...)
if you're comfortable treating lat/long as coordinates in a 2D plane, then you have lot of choices of smoothing basis, e.g. ?mgcv::smooth.construct.gp.smooth.spec (Gaussian process smoothers, which include most of the standard spatial autocorrelation models as special cases)
A helpful link for getting up to speed with GAMs in R ...
Related
I have been using lppm from spatstat and I want to fit a log-linear model.
I can define covariates as linfun object and use in the model.
Let's say we are interested in modeling the car theft problem in Australia. let's assume cov1 is the distance to the nearest school and cov2 is the distance to the nearest police department.
We want to use X and Y coordinates in the model.
lppm(L~cov1+cov2+x+y} would work? X and Y's in the model are the location of events?
how can I use thin-plate spline on the linear network? I can create grids on ppp but lpp is not as straight forward as I think. Can I pass a matrix to lppm object?
Code in spatstat for linear networks is still under development, but lppm is based upon ppm, so you can look at the help files or documents about ppm for explanation. The variable names appearing in the model formula can be
the names of images (of class im or linim)
the names of spatial functions (class funxy or linfun)
the symbols x, y (representing cartesian coordinates)
the symbol marks representing categorical mark value
A term in the model formula may be just the name of one of these variables, or an expression involving these variable names, including functions applied to these variables.
Your example would work.
You can get B-splines of the cartesian coordinates by including a term such as bs(x)
If you need more help, first read chapter 9 of the spatstat book
I have a panel dataset with several hundred regions, ~10 years and spatial data for the regions. I created a weight matrix with the spdeppackage (via the standard way, and then, nb2listw).
I have, thus, a matrix with weights for each region (in relation to the other regions) - but each region is represented just once.
I would like to run some of the spatial regressions from the spdeppackage (lagsarlm, errorsarlm), but I get an error:
Error in subset.listw(listw, subset, zero.policy = zero.policy) :
Not yet able to subset general weights lists
and
Error in lagsarlm(y ~ x1 + x2: Input data and weights have different dimensions
I assume this is because the weight matrix has only one row per region (and then, only one year can be calculated). Do you have any suggestions how to attack the issue?
My ideas revolve around the following:
Extend the spatial weight matrix OR
Tell spdep that the regions will repeat in the same order (but how?)
Looking forward to your suggestions.
I am analysing ambulance incident data. The dataset covers three years and has roughly 250000 incidents.
Preliminary analysis indicates that the incident distribution is related to population distribution.
Fitting a point process model using spatstat agrees with this, with broad agreement in a partial residual plot.
However, it is believed that the trend diverges from this population related trend during the "social hours", that is Friday, Saturday night, public holidays.
I want to take subsets of the data and see how they differ from the gross picture. How do I account for the difference in intensity due to the smaller number of points inherent in a subset of the data?
Or is there a way to directly use my fitted model for the gross picture?
It is difficult to provide data as there are privacy issues, and with the size of the dataset, it's hard to simulate the situation. I am not by any means a statistician, hence I am flundering a bit here. I have a copy of
"Spatial Point Patterns Methodology and Applications with R" which is very useful.
I will try with pseudocode to explain my methodology so far..
250k_pts.ppp <- ppp(the_ambulance_data x and y, the_window)
1.3m_census_pts <- ppp(census_data x and y, the_window)
Best bandwidth for the density surface by visual inspection seemed to be bw.scott. This was used to fit a density surface for the points.
inc_density <- density(250k_pts.ppp, bw.scott)
pop_density <- density(1.3m_census_pts, bw.scott)
fit0 <- ppm(inc_density ~ 1)
fit_pop <- ppm(inc_density ~ pop_density)
partials <- parres(fit_pop, "pop_density")
Plotting the partial residuals shows that the agreement with the linear fit is broadly acceptable, with some areas of 'wobble'..
What I am thinking of doing next:
the_ambulance_data %>% group_by(day_of_week, hour_of_day) %>%
select(x_coord, y_coord) %>% nest() -> nested_day_hour_pts
Taking one of these list items and creating a ppp, say fri_2300hr_ppp;
fri23.den <- density(fri_2300hr_ppp, bw.scott)
fit_fri23 <- fit(fri_2300hr_ppp ~ pop_density)
How do I then compare this ppp or density with the broader model? I can do characteristic tests such as dispersion, clustering.. Can I compare the partial residuals of fit_pop and fit_fri23?
How do I control for the effect of the number of points on the density - i.e. I have 250k points versus maybe 8000 points in the subset. I'm thinking maybe quantiles of the density surface?
Attach marks to the ambulance data representing the subset/categories of interest (eg 'busy' vs 'non-busy'). For an informal or nonparametric analysis, use tools like relrisk, or use density.splitppp after separating the different types of points using split.ppp. For a formal analysis (taking into account the sample sizes etc etc) you should fit several candidate models to the same data, one model having a busy/nonbusy effect and another model having no such effect, then use anova.ppm to test formally whether there is a busy/nonbusy effect. See Chapter 14 of the book mentioned.
I have been unable to find any information specific to local block kriging with a local variogram using the gstat package in R. There is freeware called VESPER from the Australian Center for Precision Agriculture that is able to do this, and from what I have read it should be possible in R, I could just use some help with putting together a for-loop to make the gstat functions work locally.
Using the meuse data set as an example, I have been able to calculate and fit a global variogram to a data set:
library(gstat)
data(meuse)
coordinates(meuse) = ~x+y
data(meuse.grid)
gridded(meuse.grid) = ~x+y
logzinc_vgm<- variogram(log(zinc)~1, meuse)
logzinc_vgm_fit <- fit.variogram(logzinc_vgm, model=vgm("Sph", "Exp"))
logzinc_vgm_fit
plot(logzinc_vgm, logzinc_vgm_fit)
This gives a nice plot of the variogram for the whole data set with the fitted model. Then I can use this to perform block kriging over the entire data set:
logzinc_blkkrig <- krige(log(zinc)~1, meuse, meuse.grid, model = logzinc_vgm_fit, block=c(100,100))
spplot(logzinc_blkkrig["var1.pred"], main = "ordinary kriging predictions")
spplot(logzinc_blkkrig["var1.var"], main = "ordinary kriging variance")
This produces a plot of the interpolated data as well as a plot of the variance for each predicted point. So this would be perfect if I wanted these functions to work once for my entire data set...
But I have been unable to generate a for-loop to handle these functions on a local level.
My goals are:
1. For each point in my grid file (which I have tried as both a data frame and SpatialPointsDataFrame), I would like to subset from my data file points within the distance diagonally of the range given in the global variogram (easy to call this location (i.e. logzinc_vgm_fit[2,3]))
2. On this subset of data, I would like to calculate the variogram (as above) and fit a model to it (as above)
3. Based on this model, I would like to perform block kriging to get a predicted value and variance at that grid point
4. Build the above three steps into a for-loop to predict values at each grid point based on the local variogram around each grid point
note: as with the meuse data set built into the gstat package, the dimensions of my grid and data data frames are different
Thank you very much for chiming in if anyone is able to tackle this question. Happy to post the code I am working with so far if it would be useful.
I made a for loop that I think accomplishes what you request. I do not think that block kriging is required for this because the loop predicts at each grid cell.
The rad parameter is the search radius, which can be set to other quantities, but currently references the global variogram range (with nugget effect). I think it would be best to search a little further for points because if you only search up to the global variogram range, a local variogram fit may not converge (i.e. no observed range).
The k parameter is for the minimum number of nearest neighbors within rad. This is important because some locations may have no points within rad, which would result in an error.
You should note that the way you specified model=vgm("Sph", "Exp") seems to take the first listed method. So, I used the Spherical model in the for loop, but you can change to what you want to use. Matern may be a good choice if you think the shape will change with location.
#Specify the search radius for the local variogram
rad = logzinc_vgm_fit[2,3]
#Specify minimum number of points for prediction
k = 25
#Index to indicate if any result has been stored yet
stored = 0
for (i in 1:nrow(meuse.grid)){
#Calculate the Euclidian distance to all points from the currect grid cell
dists = spDistsN1(pts = meuse, pt = meuse.grid[i,], longlat = FALSE)
#Find indices of the points within rad of this grid point
IndsInRad = which(dists < rad)
if (length(IndsInRad) < k){
print('Not enough nearest neighbors')
}else{
#Calculate the local variogram with these points
locVario = variogram(log(zinc)~1, meuse[IndsInRad,])
#Fit the local variogram
locVarioFit = fit.variogram(logzinc_vgm, model=vgm("Sph"))
#Use kriging to predict at grid cell i. Supress printed output.
loc_krig <- krige(log(zinc)~1, meuse[IndsInRad,], meuse.grid[i,], model = locVarioFit, debug.level = 0)
#Add result to database
if (stored == 0){
FinalResult = loc_krig
stored = 1
}else{
FinalResult = rbind(FinalResult, loc_krig)
}
}
}
I have been using the mahal classifier function (Dismo package in r) in several of my analyses and recently I have discovered that it seems to give apparently wrong distance results for points that are identical to points used in training of the classifier. For background, from what I understand of mahalanobis-based classifiers, is that they use Mahalanobis distance to describe the similarity of a unclassified point by measuring the point's distance from the center of mass of the training set (while accounting for differences in scale and covariance, etc.). The mahalanobis distance score varies from –inf to 1, where one indicates no distance between the unclassified point and the centroid defined by the training set. However, I found that, for all points with identical predictor values than the training points, I still get a score of 1, as if the routine is working as a nearest neighbor classifier. This is a very troubling behavior because it has the potential to artificially increase the confidence of my overall classification.
Has anyone encountered this behavior? Any ideas on how to fix/ avoid this behavior?
I have written a small script below that showcases the odd behavior clearly:
rm(list = ls()) #remove all past worksheet variables
library(dismo)
logo <- stack(system.file("external/rlogo.grd", package="raster"))
#presence data (points that fall within the 'r' in the R logo)
pts <- matrix(c(48.243420, 48.243420, 47.985820, 52.880230, 49.531423, 46.182616,
54.168232, 69.624263, 83.792291, 85.337894, 74.261072, 83.792291, 95.126713,
84.565092, 66.275456, 41.803408, 25.832176, 3.936132, 18.876962, 17.331359,
7.048974, 13.648543, 26.093446, 28.544714, 39.104026, 44.572240, 51.171810,
56.262906, 46.269272, 38.161230, 30.618865, 21.945145, 34.390047, 59.656971,
69.839163, 73.233228, 63.239594, 45.892154, 43.252326, 28.356155), ncol=2)
# fit model
m <- mahal(logo, pts)
#using model, predict train data
training_vals=extract(logo, pts)
x <- predict(m, training_vals)
x #results show a perfect 1 prediction, which is highly unlikely
Now, I try to make predictions for values that are an average for directly adjacent point pairs
I do this because given that:
(1) each point for each pair used to train the model have a perfect suitability and
(2) that at least some of these average points are likely to be as close to the center of the mahalanobis centroid than the original pairs
(3) I would expect at least a few of the average points to have a perfect suitability as well.
#pick two adjacent points and fit model
adjacent_pts=pts
adjacent_pts[,2]=adjacent_pts[,2]+1
adjacent_training_vals=extract(logo, adjacent_pts)
new_pts=rbind(pts, adjacent_pts)
plot(logo[[1]]) #plot predictor raster and response point pairs
points(new_pts[,1],new_pts[,2])
#use model to predict mahalanobis score for new training data (point pairs)
m <- mahal(logo, new_pts)
new_training_vals=extract(logo, new_pts)
x <- predict(m, new_training_vals)
x
As expected from the odd behavior described, all training points have a distance score of 1. However, lets try to predict points that are an average of each pair:
mid_vals=(adjacent_training_vals+training_vals)/2
x <- predict(m, mid_vals)
x #NONE DO!
This for me is further indication that the Mahal routine will give a perfect score for any data point that has equal values to any of the points used to train the model
This below is uncessessary, but just another way to prove the point:
Here I predict the same original train data with a near insignificant 'budge' of values for only one of the predictors and show that the resulting scores change quite significantly.
mod_training_vals=training_vals
mod_training_vals[,1]=mod_training_vals[,1]*1.01
x <- predict(m, mod_training_vals)
x #predictions suddenly are far from perfect predictions