Large dataset and autocorrelation computation - r

I have geographical data at the town level for 35 000 towns.
I want to estimate the impact of my covariates X on a dependent variable Y, taking into account autocorrelation.
I have first computed weight matrix and then I used the command spautolm from the package spam but it returned me an error message because my dataset is too large.
Do you have any ideas of how can I fix it? Is there any other equivalent commands that would work?
library(haven)
library(tibble)
library(sp)
library(data.table)
myvars <- c("longitude","latitude","Y","X")
newdata2 <- na.omit(X2000[myvars]) #drop observations with no values for one observation
df <- data.frame(newdata2)
newdata3<- unique(df) #drop duplicates in terms of longitude and latitude
coordinates(newdata3) <- c("longitude2","latitude2") #set the coordinates
coords<-coordinates(newdata3)
Sy4_nb <- knn2nb(knearneigh(coords, k = 4)) # Display the k closest neighbours
Sy4_lw_idwB <- nb2listw(Sy8_nb, glist = idw, style = "B") #generate a list weighted by the distance
When I try to run such formulas:
spautolm(formula = Y~X, data = newdata3, listw = Sy4_lw_idwB)
It returns me : Error: cannot allocate vector of size 8.3 Gb

Related

vegan::betadisper() extract distance and error associated with centroid

I am trying to construct a meta regression to look at distance between centroids across multiple independent monitoring datasets. To build that model, for each dataset I need to extract the distance to each centroid (each dataset has the same two grouping variables -- before, after), the number of points that went into calculating the centroid (n), and the standard deviation associated with each distance to centroid (sd). I'm using vegan::betadisper() to calculate the distance to each centroid, but I am not sure whether it is possible to extract a single unit of standard deviation associated with the centroid?
I've modified the dune dataset below as sample code. The 'Use' grouping variable has two levels: before, after.
rm(list=ls())
library (vegan)
library(dplyr)
# Species and environmental data
dune2.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune2.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
data (dune) # matrix with species data (20 samples in rows and 30 species in columns)
data (dune.env)# matix of environmental variables (20 samples in rows and 5 environmental variables in columns)
#select two grouping levels for 'use'
dune_data <- cbind(dune2.spe,dune2.env)%>%
filter(Use=='Pasture'|Use=='Hayfield')
dune_data$Use <- recode_factor(dune_data$Use, 'Pasture'='Before')
dune_data$Use <- recode_factor(dune_data$Use, 'Hayfield'='After')
dune_sp <- dune_data%>%
dplyr::select(1:28)
dune_en <- dune_data%>%
dplyr::select(29:33)
#transform relative species counts
dune_rel <- decostand(dune_sp, method = "hellinger")
dune_distmat <- vegdist(dune_rel, method = "bray", na.rm=T)
(dune_disper <- betadisper(dune_distmat, type="centroid", group=dune_en$Use))
plot(dune_disper, label=FALSE)
I am trying to arrive at the following output:
Group
before_distance
n_before
sd_before
after_distance
n_after
sd_after
Dune
0.4009
5
?
0.4314
7
?

How to replicate a model 10 times and extract several objects (test results) from it, and then make a mean?

excuse me my long question but I really hope that somebody could try to help me improve my code. Basically that's what I would like to do: reiterate the same model (as example random forests) 10 times with different inputs. As a result of each iteration I would like to extract from each model several parameters and after all iterations make a mean and standard deviation from them (for example mean AUC, mean bias). I may upload the input files but my problem is connected to a step that doesn't directly relies on them and I presume it may be solved using some coding. Here is an example:
I'm working with species distribution models using data from a vignette accompanying "dismo" package. All the code may be found here: https://rspatial.org/raster/sdm/6_sdm_methods.html#random-forest
First I'm creating a data of species occurences (pb=1) and pseudo-absences (pb=0). Those are accompanied by longitude and lititude cooridinates in two columns, later environmental variables are joined to each point. Everything works fine here, so I'm able to create a model. But I would like to make a several models and average their results.
These are my initial steps:
require(raster)
#that is my file with occurrence points:
points_herb <- read.csv("herbarium.csv",header=TRUE)
points_herb <- points_herb[,2:3]
points_herb <- SpatialPointsDataFrame(coords = points_herb, data = points_herb, proj4string + CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
> head(points_herb)
lon_x lat_y
1 19.62083 49.62917
2 19.64583 49.62917
3 20.23750 49.61250...
#Variables (I use variables from PCA ran on climate)
files <- list.files("D:/variables/",pattern='asc',full.names=TRUE)
predictors <- raster::stack(files)
> predictors
class : RasterStack
dimensions : 1026, 1401, 1437426, 2 (nrow, ncol, ncell, nlayers)
resolution : 0.008333333, 0.008333333 (x, y)
extent : 16.36667, 28.04167, 42.7, 51.25 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
names : PCA1, PCA2
#Assigning variables to points
presvals <- extract(predictors, points_herb)
reading background points (about 20000):
points_back <- read.csv("back.csv",header=TRUE,dec = ".",sep = ",")
points_back <- points_back[,2:3]
points_back <- SpatialPointsDataFrame(coords = points_back, data = points_back, proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))
assigning variables to background/pseudoabsence points
absvals <- extract(predictors, points_back)
absvals <- unique(absvals)
#**this is important!** Sampling 1000 random points from my entire dataset containing ca. 20000
absvals_1 <- absvals[sample(nrow(absvals), 1000), ]
#making an input file for the modeling
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals_1)))
sdmdata1 <- data.frame(cbind(pb, rbind(presvals, absvals_1)))
sdmdata1 <- na.omit(sdmdata1)```
> head(sdmdata1)
pb PCA1 PCA2
1 1 9.985359 2.419048
2 1 8.711462 2.229476
...
I run the model:
#Random Forest
library(dismo)
library(randomForest)
#rf1- first random forest model
model_rf1 <- pb ~ PCA1 + PCA2
bc <- randomForest(model_rf1, data=sdmdata1)
#the model is predicted over a geographic space
bc_mod <- predict(predictors, bc, progress='')
#let's test it using CalibratR
require(CalibratR)
#extracting model probabilities to presence and absence points (those are actually from a separate dataset)
points_pres1 <- extract(bc_mod, points_pres1, cellnumbers=TRUE)
points_abs1 <- extract(bc_mod, points_abs1, cellnumbers=TRUE)
#prepare those data to test the model
testECE <- c(rep(1, nrow(points_pres1)), rep(0, nrow(points_abs1)))
testECE <- data.frame(cbind(testECE, rbind(points_pres1, points_abs1)))
testECE <- na.omit(testECE)
testECE <- subset(testECE, select = c(testECE, layer))
#make Expected Calibration Error
ECE <- getECE(testECE$testECE, testECE$layer, n_bins = 10)
#make Maximum Calibration Error
MCE <- getMCE(testECE$testECE, testECE$layer, n_bins = 10)
#some other test
require(Metrics)
#get RMSE values
RMSE <- rmse(testECE$testECE, testECE$layer)
random_forest_1 <- data.frame(mget(c('ECE', 'RMSE', 'MCE')))
rownames(random_forest_1) <- "random_forest1"
Then I would like to run the same model but using a different background points. So in that case I make another input file, with another 1000 random points from the entire dataset:
absvals_2 <- absvals[sample(nrow(absvals), 1000), ]
pb <- c(rep(1, nrow(presvals_2)), rep(0, nrow(absvals_2)))
sdmdata2 <- data.frame(cbind(pb, rbind(presvals_2, absvals_2)))
sdmdata2 <- na.omit(sdmdata2)
model_rf2 <- pb ~ variable1 + variable2
bc <- randomForest(model_rf2, data=sdmdata2)
bc_mod <- predict(predictors, bc, progress='')
#again, let's test it using CalibratR
points_pres2 <- extract(bc_mod, points_pres2, cellnumbers=TRUE)
points_abs2 <- extract(bc_mod, points_abs2, cellnumbers=TRUE)
# everything just as above, the objects are overwritten
testECE <- c(rep(1, nrow(points_pres2)), rep(0, nrow(points_abs2)))
testECE <- data.frame(cbind(testECE, rbind(points_pres2, points_abs2)))
testECE <- na.omit(testECE)
testECE <- subset(testECE, select = c(testECE, layer))
ECE <- getECE(testECE$testECE, testECE$layer, n_bins = 10)
MCE <- getMCE(testECE$testECE, testECE$layer, n_bins = 10)
RMSE <- rmse(testECE$testECE, testECE$layer)
random_forest_2 <- data.frame(mget(c('ECE', 'RMSE', 'MCE')))
rownames(random_forest_2) <- "random_forest2"
#And finally let's make a mean from ECE, MCE, and RMSE
rf_results <- rbind(random_forest_1, random_forest_2)
rf_results_mean <- sapply(rf_results, 2, FUN=mean)
#and standard deviation
rf_results_sd <- sapply(rf_results, 2, FUN=sd)
result <- rbind(rf_results_mean, rf_results_sd)
In this example a made just 2 repetitions, but ideally I would like to make a 10 or 100. How to make it more elegant and automatic rather than creating manually 100 objects..?
Here's at least part of a solution for you using purrr and dplyr and iterating over lists. This would give the advantage of storing your samples and results in one dataframe.
In my example below I've used a randomly generated dataframe and a very simple function to demonstrate. I'll point out at the end how you might fit this to your own data and processes. I haven't tried this on your code and data above, as it's quite long and complex and would take a while to get my head around your methods. But hopefully you'll be able to see how to work this into your own process.
library(dplyr)
library(purrr)
# step 1: create a function that takes a dataframe and returns a dataframe
calculate_mean_sd <- function(df){
tibble(
mean_lat = mean(df$lat),
sd_lad = sd(df$lat),
mean_long = mean(df$long),
sd_long = sd(df$long)
)
}
# random dataframe with all values you'd want to use (i.e. your `absvals` above)
full_df <- tibble(
id = 1:100000,
lat = runif(100000, 0, 100),
long = runif(100000, 0, 100)
)
# step 2: create an empty list with the number (100) of loops you want to do
df <- as.list(1:100) %>%
map(~ tibble(iteration = .x)) # makes the iteration number into a dataframe to use later
# step 3: for each of 100 rows take a sample of a specified size and add to list as a dataframe
samples <- df %>%
map( ~ mutate(.x, sample = list(full_df[sample(nrow(full_df), 100),])))
# step 4: iterate over list and pass your dataframes to the function, add results to new column
results <- samples %>%
map_df( ~ mutate(.x, results = list(bind_cols(.x[1], calculate_mean_sd(.x$sample[[1]])))))
# final, optional step: output a dataframe with iteration labelled and results
results$results %>%
reduce(bind_rows)
In your data above, you'd want to use absvals[sample(nrow(absvals), 1000), ] to sample your data in step 3, and then put the parts after this into a function which returns a dataframe with the columns you need.
This is perhaps short of giving a full answer but hopefully some helpful steps in iterating using purrr as a useful tool.
Edit: P.S. please do comment with any questions or parts that aren't working and I'll see if I can add any clarifications or notes above.

Fitting spatial regression with repeated measures making incorrect neighbours

I am trying to fit a spatial lag model (spdep::lagsarlm), after having built a neighbour distance matrix. I have two questions, because every time I read about it, the model always fit data that has only one single observation (one row) per each spatial location.
My dataset has a variable number of observations for each spatial point (but it's not temporal data) and I was wondering if it was valid to do like this, especially when creating the distance matrix because I get a warning:
Warning message:
In spdep::knearneigh(., k = 3, longlat = F) :
knearneigh: identical points found
Indeed when I plot the neighbours relationships, I get a wrong graph (I guess that the algorithm thinks that the repeated points are neighbours with themselves so they get isolated); when I filter only the first measure, the plot is OK.
library(sp); library(spdep);set.seed(12345678)
df = data.frame('id'=rep(1:10, 3),
'x'=rep(rnorm(10, 48, 0.1), 3),
'y'=rep(rnorm(10, 2.3, 0.05),3),
'response'=c(rnorm(5), rnorm(20, 1), rnorm(5)),
'type.sensor'=rep(c(rep("a", 6), rep("b", 4)), 3))
coordinates(df)<-c("x", "y")
w <- df %>% spdep::knearneigh(k=3, longlat=F) %>% knn2nb
plot(w, coordinates(df))
df2 = head(df, 10) # I keep only the first measure for each location
w2 <- df2 %>% spdep::knearneigh(k=3, longlat=F) %>% knn2nb
plot(w2, coordinates(df2))
So i'm not very confident in the result of my lagsarlm model in the first case..
lagsarlm(response ~ type.sensor, data=df, listw=nb2listw(w), type = "lag" )
lagsarlm(response ~ type.sensor, data=df, listw=nb2listw(w2), type = "lag" )
However, if I try to fit my model with the larger dataset, but with the right neighbours matrix, it complains
Error in lagsarlm(response ~ type.sensor, data = df, listw = nb2listw(w2), :
Input data and weights have different dimensions
How can I deal with such data, in the end? Thanks.

How to input dissimilarity matrix in spatial analysis in spdep R

Aim: I want to create a dissimilarity matrix between pairs of coordinates. I want to use this matrix as an input to calculate local spatial clusters using Moran's I (LISA) and latter in geographically weighted regression (GWR).
Problem: I know I can use dnearneigh{spdep} to calculate a distance matrix. However, I want to use the travel-time between polygons I already have estimated. In practice, I think this would be like inputting a dissimilarity matrix that tells the distance/difference between polygons based on a another characteristic. I've tried inputting my matrix to dnearneigh{spdep}, but I get the error Error: ncol(x) == 2 is not TRUE
dist_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat = F, row.names=rn)
Any suggestions? There is a reproducible example below:
EDIT: Digging a bit further, I think I could use mat2listw{spdep} but I'm still not sure it keeps the correspondence between the matrix and the polygons. If I add row.names = T it returns an error row.names wrong length :(
listw_dissi <- mat2listw(diss_matrix_invers)
lmoran <- localmoran(oregon.tract#data$white, listw_dissi,
zero.policy=T, alternative= "two.sided")
Reproducible example
library(UScensus2000tract)
library(spdep)
library(ggplot2)
library(dplyr)
library(reshape2)
library(magrittr)
library(data.table)
library(reshape)
library(rgeos)
library(geosphere)
# load data
data("oregon.tract")
# get centroids as a data.frame
centroids <- as.data.frame( gCentroid(oregon.tract, byid=TRUE) )
# Convert row names into first column
setDT(centroids, keep.rownames = TRUE)[]
# create Origin-destination pairs
od_pairs <- expand.grid.df(centroids, centroids) %>% setDT()
colnames(od_pairs) <- c("origi_id", "long_orig", "lat_orig", "dest_id", "long_dest", "lat_dest")
# calculate dissimilarity between each pair.
# For the sake of this example, let's use ellipsoid distances. In my real case I have travel-time estimates
od_pairs[ , dist := distGeo(matrix(c(long_orig, lat_orig), ncol = 2),
matrix(c(long_dest, lat_dest), ncol = 2))]
# This is the format of how my travel-time estimates are organized, it has some missing values which include pairs of origin-destination that are too far (more than 2hours apart)
od_pairs <- od_pairs[, .(origi_id, dest_id, dist)]
od_pairs$dist[3] <- NA
> origi_id dest_id dist
> 1: oregon_0 oregon_0 0.00000
> 2: oregon_1 oregon_0 NA
> 3: oregon_2 oregon_0 39874.63673
> 4: oregon_3 oregon_0 31259.63100
> 5: oregon_4 oregon_0 33047.84249
# Convert to matrix
diss_matrix <- acast(od_pairs, origi_id~dest_id, value.var="dist") %>% as.matrix()
# get an inverse matrix of distances, make sure diagonal=0
diss_matrix_invers <- 1/diss_matrix
diag(diss_matrix_invers) <- 0
Calculate simple distance matrix
# get row names
rn <- sapply(slot(oregon.tract, "polygons"), function(x) slot(x, "ID"))
# get centroids coordinates
coords <- coordinates(oregon.tract)
# get distance matrix
diss_matrix <- dnearneigh(diss_matrix_invers, d1=0, d2=5, longlat =T, row.names=rn)
class(diss_matrix)
> [1] "nb"
Now how to use my diss_matrix_invers here?
you are right about the use of matlistw{spdep}. By default the function preserves the names of rows to keep correspondence between the matrix. You can also specify the row.names like so:
listw_dissi <- mat2listw(diss_matrix_invers, row.names = row.names(diss_matrix_invers))
The list that is created will contain the appropriate names for the neighbours along with their distance as weights. You can check this by looking at the neighbours.
listw_dissi$neighbours[[1]][1:5]
And you should be able to use this directly to calculate Moran's I.
dnearneigh{sdep}
There is no way you can use diss_matrix within dnearneigh{spdep}, as this function takes in a list of coordinates.
however, if you need to define a set of neighbours given a distance threshold (d1,d2) using your own distance matrix (travel-time). I think this function can do the trick.
dis.neigh<-function(x, d1 = 0, d2=50){
#x must be a symmetrical distance matrix
#create empty list
style = "M" #for style unknown
neighbours<-list()
weights<-list()
#set attributes of neighbours list
attr(neighbours, "class")<-"nb"
attr(neighbours, "distances")<-c(d1,d2)
attr(neighbours, "region.id")<-colnames(x)
#check each row for neighbors that satisfy distance threshold
neighbour<-c()
weight<-c()
i<-1
for(row in c(1:nrow(x))){
j<-1
for(col in c(1:ncol(x))){
if(x[row,col]>d1 && x[row,col]<d2){
neighbour[j]<-col
weight[j]<-1/x[row,col] #inverse distance (dissimilarity)
j<-1+j
}
}
neighbours[i]<-list(neighbour)
weights[i]<-list(weight)
i<-1+i
}
#create neighbour and weight list
res <- list(style = style, neighbours = neighbours, weights = weights)
class(res) <- c("listw", "nb")
attr(res, "region.id") <- attr(neighbours, "region.id")
attr(res, "call") <- match.call()
return(res)
}
And use it like so:
nb_list<-dis.neigh(diss_matrix, d1=0, d2=10000)
lmoran <- localmoran(oregon.tract#data$white, nb_lists, alternative= "two.sided")

Tapply only producing missing values

I'm trying to generate estimates of the percent of Catholics within a given municipality in a country and I'm using multilevel regression and post-stratification of survey data.
The approach fits a multilevel logit and generates predicted probabilities of the dependent variable. It then weights the probabilities using poststratification of the sample to census data.
I can generate the initial estimates (which are essentially just the predicted probability of being Catholic for a given individual in the survey data.) However, when I try to take the average with the last line of code below it only returns NA's for each of the municipalities. The initial cell predictions have some missing values but nowhere near a majority.
I don't understand why I can't generate municipal weighted averages as I've followed the procedure using different data. Any help would be greatly appreciated.
rm(list=ls(all=TRUE))
library("arm")
library("foreign")
#read in megapoll and attach
ES.data <- read.dta("ES4.dta", convert.underscore = TRUE)
#read in municipal-level dataset
munilevel <- read.dta("election.dta",convert.underscore = TRUE)
munilevel <- munilevel[order(munilevel$municode),]
#read in Census data
Census <- read.dta("poststratification4.dta",convert.underscore = TRUE)
Census <- Census[order(Census$municode),]
Census$municode <- match(Census$municode, munilevel$municode)
#Create index variables
#At level of megapoll
ES.data$ur.female <- (ES.data$female *2) + ES.data$ur
ES.data$age.edr <- 6 * (ES.data$age -1) + ES.data$edr
#At census level (same coding as above for all variables)
Census$cur.cfemale <- (Census$cfemale *2) + Census$cur
Census$cage.cedr <- 6 * (Census$cage -1) + Census$cedr
##Municipal level variables
Census$c.arena<- munilevel$c.arena[Census$municode]
Census$c.fmln <- munilevel$c.fmln[Census$municode]
#run individual-level opinion model
individual.model1 <- glmer(formula = catholic ~ (1|ur.female) + (1|age)
+ (1|edr) + (1|age.edr) + (1|municode) + p.arena +p.fmln
,data=ES.data, family=binomial(link="logit"))
display(individual.model1)
#examine random effects and standard errors for urban-female
ranef(individual.model1)$ur.female
se.ranef(individual.model1)$ur.female
#create vector of state ranefs and then fill in missing ones
muni.ranefs <- array(NA,c(66,1))
dimnames(muni.ranefs) <- list(c(munilevel$municode),"effect")
for(i in munilevel$municode){
muni.ranefs[i,1] <- ranef(individual.model1)$municode[i,1]
}
muni.ranefs[,1][is.na(muni.ranefs[,1])] <- 0 #set states with missing REs (b/c not in data) to zero
#create a prediction for each cell in Census data
cellpred1 <- invlogit(fixef(individual.model1)["(Intercept)"]
+ranef(individual.model1)$ur.female[Census$cur.cfemale,1]
+ranef(individual.model1)$age[Census$cage,1]
+ranef(individual.model1)$edr[Census$cedr,1]
+ranef(individual.model1)$age.edr[Census$cage.cedr,1]
+muni.ranefs[Census$municode,1]
+(fixef(individual.model1)["p.fmln"] *Census$c.fmln) # municipal level
+(fixef(individual.model1)["p.arena"] *Census$c.arena)) # municipal level
#weights the prediction by the freq of cell
cellpredweighted1 <- cellpred1 * Census$cpercent.muni
#calculates the percent within each municipality (weighted average of responses)
munipred <- 100* as.vector(tapply(cellpredweighted1, Census$municode, sum))
munipred
The extensive amount of code is totally redundant without the data! I suppose you have NAs in the object cellpredweighted1 and by default sum() propagates NAs to the answer because if one or more elements of a vector is NA then by definition the summation of those elements is also NA.
If the above is the case here, then simply adding na.rm = TRUE to the tapply() call should solve the problem.
tapply(cellpredweighted1, Census$municode, sum, na.rm = TRUE)
You should be asking yourself why there are NAs at this stage and if these result from errors earlier on the process.

Resources