Fitting spatial regression with repeated measures making incorrect neighbours - r

I am trying to fit a spatial lag model (spdep::lagsarlm), after having built a neighbour distance matrix. I have two questions, because every time I read about it, the model always fit data that has only one single observation (one row) per each spatial location.
My dataset has a variable number of observations for each spatial point (but it's not temporal data) and I was wondering if it was valid to do like this, especially when creating the distance matrix because I get a warning:
Warning message:
In spdep::knearneigh(., k = 3, longlat = F) :
knearneigh: identical points found
Indeed when I plot the neighbours relationships, I get a wrong graph (I guess that the algorithm thinks that the repeated points are neighbours with themselves so they get isolated); when I filter only the first measure, the plot is OK.
library(sp); library(spdep);set.seed(12345678)
df = data.frame('id'=rep(1:10, 3),
'x'=rep(rnorm(10, 48, 0.1), 3),
'y'=rep(rnorm(10, 2.3, 0.05),3),
'response'=c(rnorm(5), rnorm(20, 1), rnorm(5)),
'type.sensor'=rep(c(rep("a", 6), rep("b", 4)), 3))
coordinates(df)<-c("x", "y")
w <- df %>% spdep::knearneigh(k=3, longlat=F) %>% knn2nb
plot(w, coordinates(df))
df2 = head(df, 10) # I keep only the first measure for each location
w2 <- df2 %>% spdep::knearneigh(k=3, longlat=F) %>% knn2nb
plot(w2, coordinates(df2))
So i'm not very confident in the result of my lagsarlm model in the first case..
lagsarlm(response ~ type.sensor, data=df, listw=nb2listw(w), type = "lag" )
lagsarlm(response ~ type.sensor, data=df, listw=nb2listw(w2), type = "lag" )
However, if I try to fit my model with the larger dataset, but with the right neighbours matrix, it complains
Error in lagsarlm(response ~ type.sensor, data = df, listw = nb2listw(w2), :
Input data and weights have different dimensions
How can I deal with such data, in the end? Thanks.

Related

GAM distributed lag model with factor smooth interaction (by variable)

I'm trying to compare the climate response in the last 60 years of two subgroups of a plant (factor variable subgroups with 2 levels). The response of the two subgroups which both grew on the same plots is measured in deviation from the long-term growth (plant_growth). As climate data mean temperature (tmean) and mean precipitation (prec) are available.
I formulated a distributed lag model using mgcv's gam() to test the hypothesis, that the climate response differs between the plant subgroups:
climate_model <- gam(plant_growth ~ te(tmean, lag, by = subgroups) +
te(prec, lag, , by = subgroups) +
te(tmean, prec, lag, , by = subgroups) ,
data = plant_data)
plant_data is a list that contains tmean, prec and lag as separate numeric matrices, subgroups as factor variable which distinguishes between subgroup A and B, a character variable giving the ID of the plant, and the numeric measured plant_growth as vector.
The problem is, however, that factor by variables cannot be used with the matrix arguments from plant_data. The error message looks as follows:
Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, :
factor `by' variables can not be used with matrix arguments.
I'm wondering if there is a way to include the factor variable subgroups into the distributed lag model so that a comparison between the two levels of the factor is possible.
I've already tried running two separate lag models for the two levels of subgroups. This works fine. However, I cannot really compare the predictions of the two models because the fit and the parameters of the smooths are different. Moreover, in this way the the climate response of the two subgroups is treated as if it was completely independent. This is however not the case.
I was reproduce my problem with growth data from the Treeclim package:
library("treeclim") #Data library
data("muc_spruce") #Plant growth
data("muc_clim") #Climate data
#Format climate to wide
clim <- pivot_wider(muc_clim, names_from = month, values_from = c(temp,prec))
#Format the growth data and add three new groth time series
growth <- muc_spruce %>%
select(-samp.depth) %>%
mutate(year = as.numeric(row.names(muc_spruce))) %>%
mutate(ID = 1) %>%
rename("plant_growth" = "mucstd")
additional_growth <- data.frame()
for (i in c(1:3)){
A <- growth %>%
mutate(plant_growth = plant_growth + runif(nrow(muc_spruce), min = 0, max = 0.5)) %>%
mutate(ID = ID + i)
additional_growth <- rbind(additional_growth, A)
}
growth <- rbind(growth, additional_growth)
#Bring growth and climate data together
plant_data <- na.omit(left_join(growth, clim))
rm(A, growth, clim, muc_clim, muc_spruce, additional_growth, i) #clean
#Add the subgroups label
plant_data$subgroups <- as.factor(c(rep("A", nrow(plant_data)/2), rep("B", nrow(plant_data)/2)))
#Format for gam input
plant_data <- list(lag = matrix(1:12,nrow(plant_data),12,byrow=TRUE),
year = plant_data$year,
ID = plant_data$ID,
plant_growth = plant_data$plant_growth,
subgroups = as.factor(plant_data$subgroups),
tmean = data.matrix(plant_data[,c(4:15)]),
prec = data.matrix(plant_data[,c(16:27)]))
From ?mgcv::linear.functional.terms:
The mechanism is usable with random effect smooths which take factor arguments, by using a trick to create a 2D array of factors. Simply create a factor vector containing the columns of the factor matrix stacked end to end (column major order). Then reset the dimensions of this vector to create the appropriate 2D array: the first dimension should be the number of response data and the second the number of columns of the required factor matrix. You can not use matrix or data.matrix to set up the required matrix of factor levels. See example below:
## set up a `factor matrix'...
fac <- factor(sample(letters,n*2,replace=TRUE))
dim(fac) <- c(n,2)
You cannot create a factor matrix tough, but can create a factor and modify the dims afterwars.

Create ROC curve manually from data frame

I have the below conceptual problem which I can't get my head around.
Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.
Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.
So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.
Any ideas?
example data:
time column indidates the time needed per case
truth column indicates my current decision I want to compare against
predicted column indicates the cleaning decision if I would cut at a time threshold of 2.5s. This is waht I need to change/loop through.
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
You can use ROCR too for this
library(ROCR)
set.seed(3)
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) %>%
mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)
You can also check the AUC value:
auc <- performance(pred, measure = "auc")
auc#y.values[[1]]
[1] 0.92
Cross checking the AUC value with pROC
library(pROC)
roc(df$truth, df$time)
Call:
roc.default(response = df$truth, predictor = df$time)
Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92
For both the cases, it is same!
So my idea was to just create a ROC curve
Creating a ROC curve is as easy as
library(pROC)
set.seed(3)
data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5)) |>
roc(truth, time) |>
plot()
The problem is that I don't have a machine-learning-like model that would give me class probabilities.
Sorry, I do not understand what is machine-learning-like about the question.
I had the idea of just looping through my data at maybe 100 different thresholds
There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time values. You can get those from roc:
df <- data.frame(time = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
truth = rep(c("cleaned", "final"), each = 5))
thresholds <- roc(df, truth, time)$thresholds
print(thresholds)
or
> print(thresholds)
[1] -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
[8] 3.093424 3.969994 4.586341 Inf
What exactly is implied in the term looping and whether you want to exclude just a for and a while loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5 a loop? There will be a loop running under the hood.

vegdist function cannot handle datasets of abundance containing 0

As a marine biologist, we need to figure out whether the fish abundance of 4 different fish species counted three times over a year differs from one artifical reef to another (reef A, B, and C) and from one month to another (June, September, November). For each area, 3 different replicates are generated (1, 2, 3).
Let's consider the gathered data (including the factors for better understanding) as follows:
data <- as.data.frame(matrix(NA, 27, 4, dimnames =
list(1:27, c("Diplodus sargus", "Chelon labrosus", "Oblada melanura", "Seriola dumerii"))))
#fish counts
data$`Diplodus sargus` <- as.numeric(c(0,0,0,0,0,0,0,0,0,5,0,0,3,0,0,0,0,1,0,0,0,0,0,0,4,0,0))
data$`Oblada melanura` <- as.numeric(c(0,0,0,10,0,0,0,0,0,0,0,0,10,5,0,0,0,0,1,0,2,3,0,2,0,0,0))
data$`Chelon labrosus`<- as.numeric(c(0,0,0,0,2,0,6,0,0,0,0,0,3,0,0,2,0,0,0,0,0,3,0,0,0,0,1))
data$`Seriola dumerii` <-as.numeric(c(4,0,2,0,1,1,0,0,9,0,0,0,0,0,3,0,0,7,0,0,0,8,0,0,0,1,0))
#factors
data$reef <- rep(c(rep("A", 3), rep("B",3), rep("C", 3)),3)
data$month <- rep(c(rep("June", 3), rep("September",3), rep("November", 3)),3)
data$combined <- c(rep("JuneA", 3), rep("JuneB",3), rep("JuneC", 3), rep("SepA", 3), rep("SepB",3), rep("SepC", 3),rep("NovA", 3), rep("NovB",3), rep("NOvC", 3))
data$Replicate <- rep(c(rep("1", 3), rep("2", 3), rep("3", 3)))
#square-root data
comp <- sqrt(data[, 1:4])
library(vegan)
mydist <- vegdist(comp, method = "bray")
pl.clust <- hclust(mydist, method = "complete")
Error in hclust(mydist, method = "complete") :
NA/NaN/Inf in foreign function call (arg 11)
The aim is to perform a Permutation ANOVA on the Bray-Curtis similarities of square root-transformed data in order to determine whether samples (assemblages of counted species) differ significantly depending on factors (alone or combined). However, vegdist function cannot handle data set with 0 as it generates vegdist objects containing NaN...which in turn cannot be handled by the adonis function. I thought of simply adding +1 to each counts as it is the differences between the samples that matter and not the absolute values. However, mydist <- ecodist::bcdist(squared_data,rmzero=FALSE) gives a very different result to that first solution. Is anybody familiar with such issue and how to correctly handle it?
Thank you and looking forward to reading you

Maximum pseudo-likelihood estimator for soft-core point process

I am trying to fit a soft-core point process model on a set of point pattern using maximum pseudo-likelihood. I followed the instructions given in this paper by Baddeley and Turner
And here is the R-code I came up with
`library(deldir)
library(tidyverse)
library(fields)
#MPLE
# irregular parameter k
k <- 0.4
## Generate dummy points 50X50. "RA" and "DE" are x and y coordinates
dum.x <- seq(ramin, ramax, length = 50)
dum.y <- seq(demin, demax, length = 50)
dum <- expand.grid(dum.x, dum.y)
colnames(dum) <- c("RA", "DE")
## Combine with data and specify which is data point and which is dummy, X is the point pattern to be fitted
bind.x <- bind_rows(X, dum) %>%
mutate(Ind = c(rep(1, nrow(X)), rep(0, nrow(dum))))
## Calculate Quadrature weights using Voronoi cell area
w <- deldir(bind.x$RA, bind.x$DE)$summary$dir.area
## Response
y <- bind.x$Ind/w
# the sum of distances between all pairs of points (the sufficient statistics)
tmp <- cbind(bind.x$RA, bind.x$DE)
t1 <- rdist(tmp)^(-2/k)
t1[t1 == Inf] <- 0
t1 <- rowSums(t1)
t <- -t1
# fit the model using quasipoisson regression
fit <- glm(y ~ t, family = quasipoisson, weights = w)
`
However, the fitted parameter for t is negative which is obviously not a correct value for a softcore point process. Also, my point pattern is actually simulated from a softcore process so it does not make sense that the fitted parameter is negative. I tried my best to find any bugs in the code but I can't seem to find it. The only potential issue I see is that my sufficient statistics is extremely large (on the order of 10^14) which I fear may cause numerical issues. But the statistics are large because my observation window spans a very small unit and the average distance between a pair of points is around 0.006. So sufficient statistics based on this will certainly be very large and my intuition tells me that it should not cause a numerical problem and make the fitted parameter to be negative.
Can anybody help and check if my code is correct? Thanks very much!

Large dataset and autocorrelation computation

I have geographical data at the town level for 35 000 towns.
I want to estimate the impact of my covariates X on a dependent variable Y, taking into account autocorrelation.
I have first computed weight matrix and then I used the command spautolm from the package spam but it returned me an error message because my dataset is too large.
Do you have any ideas of how can I fix it? Is there any other equivalent commands that would work?
library(haven)
library(tibble)
library(sp)
library(data.table)
myvars <- c("longitude","latitude","Y","X")
newdata2 <- na.omit(X2000[myvars]) #drop observations with no values for one observation
df <- data.frame(newdata2)
newdata3<- unique(df) #drop duplicates in terms of longitude and latitude
coordinates(newdata3) <- c("longitude2","latitude2") #set the coordinates
coords<-coordinates(newdata3)
Sy4_nb <- knn2nb(knearneigh(coords, k = 4)) # Display the k closest neighbours
Sy4_lw_idwB <- nb2listw(Sy8_nb, glist = idw, style = "B") #generate a list weighted by the distance
When I try to run such formulas:
spautolm(formula = Y~X, data = newdata3, listw = Sy4_lw_idwB)
It returns me : Error: cannot allocate vector of size 8.3 Gb

Resources