I would like to fit a single model to several independent datasets in R using the spatstat package. Here, I have 3 independent datasets (ppp objects called NMJ1, NMJ2, and NMJ3), to which I want to fit a common model. The way to go should be to use the mppm function:
data <- listof(NMJ1,NMJ2,NMJ3)
data <- hyperframe(X=1:3, Points=data)
r <- matrix(c(120, 240, 240, 90), nrow = 2, ncol = 2)
model <- mppm(Points ~marks*abs(sqrt(x^2+y^2)), data, MultiStrauss(r))
In the model I am fitting, the intensity is a function of the distance to the center of the window and I supposed a MultiStrauss interaction scheme.
However, the mppm function is going to fit each dataset independently. When typing subfits(model), the fitted trend coefficients are the same for each dataset, but not the gamma coefficients. Similarly, in plotting the results of simulate(model), I observe significant and consistent differences between the 3 plots.
What is the best way to handle independent datasets (repetition of samples from the same model) in spatstat ?
This is a bug.
Your code is correct for this purpose. (Namely, when we want the same interaction coefficients to apply to all of the point patterns.)
There is a bug in the function subfits in the package spatstat.core.
The fitted model returned by mppm is correct, but the list of sub-models returned by subfits is partially incorrect.
The bug will be fixed shortly, in the development version spatstat.core 2.3-0.011 available from the GitHub repository
Related
I want to implement a "combine then predict" approach for a logistic regression model in R. These are the steps that I already developed, using a fictive example from pima data from faraway package. Step 4 is where my issue occurs.
#-----------activate packages and download data-------------##
library(faraway)
library(mice)
library(margins)
data(pima)
Apply a multiple imputation by chained equation method using MICE package. For the sake of the example, I previously randomly assign missing values to pima dataset using the ampute function from the same package. A number of 20 imputated datasets were generated by setting "m" argument to 20.
#-------------------assign missing values to data-----------------#
result<-ampute(pima)
result<-result$amp
#-------------------multiple imputation by chained equation--------#
#generate 20 imputated datasets
newresult<-mice(result,m=20)
Run a logistic regression on each of the 20 imputated datasets. Inspecting convergence, original and imputated data distributions is skipped for the sake of the example. "Test" variable is set as the binary dependent variable.
#run a logistic regression on each of the 20 imputated datasets
model<-with(newresult,glm(test~pregnant+glucose+diastolic+triceps+age+bmi,family = binomial(link="logit")))
Combine the regression estimations from the 20 imputation models to create a single pooled imputation model.
#pooled regressions
summary(pool(model))
Generate predictions from the pooled imputation model using prediction function from the margins package. This specific function allows to generate predicted values fixed at a specific level (for factors) or values (for continuous variables). In this example, I could chose to generate new predicted probabilites, i.e. P(Y=1), while setting pregnant variable (# of pregnancies) at 3. In other words, it would give me the distribution of the issue in the contra-factual situation where all the observations are set at 3 for this variable. Normally, I would just give my model to the x argument of the prediction function (as below), but in the case of a pooled imputation model with MICE, the object class is a mipo and not a glm object.
#-------------------marginal standardization--------#
prediction(model,at=list(pregnant=3))
This throws the following error:
Error in check_at_names(names(data), at) :
Unrecognized variable name in 'at': (1) <empty>p<empty>r<empty>e<empty>g<empty>n<empty>a<empty>n<empty>t<empty
I thought of two solutions:
a) changing the class object to make it fit prediction()'s requirements
b) extracting pooled imputation regression parameters and reconstruct it in a list that would fit prediction()'s requirements
However, I'm not sure how to achieve this and would enjoy any advice that could help me getting closer to obtaining predictions from a pooled imputation model in R.
You might be interested in knowing that the pima data set is a bit problematic (the Native Americans from whom the data was collected don't want it used for research any more ...)
In addition to #Vincent's comment about marginaleffects, I found this GitHub issue discussing mice support for the emmeans package:
library(emmeans)
emmeans(model, ~pregnant, at=list(pregnant=3))
marginaleffects works in a different way. (Warning, I haven't really looked at the results to make sure they make sense ...)
library(marginaleffects)
fit_reg <- function(dat) {
mod <- glm(test~pregnant+glucose+diastolic+
triceps+age+bmi,
data = dat, family = binomial)
out <- predictions(mod, newdata = datagrid(pregnant=3))
return(out)
}
dat_mice <- mice(pima, m = 20, printFlag = FALSE, .Random.seed = 1024)
dat_mice <- complete(dat_mice, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
My recent foray into spatial point patterns has brought me to examining LGCP Cox processes. In my case I actually have a series of point patterns that I want to fit a single model to. One of my previous inquiries brought me to using mppm to train such models( thanks Adrian Baddeley!). My next question relates to using this type of Cox model in the context of mppm.
Is this possible to fit an inhomogeneous LGCP Cox process (or other type of Cox process) to a replicated point pattern using mppm? I see some info on fitting Gibbs processes, but not really for Cox processes.
It seems like the answer may be "possibly" through some creative use of the "random" argument.
For the sake of example, lets say I'm fitting a using point pattern Y with a single covariate X (which is a single im). The call to kppm would be:
myModel = kppm(Y ~ X,"LGCP")
If I were fitting a simple inhomogeneous Poisson process to a replicated point pattern and associated covariate in hyperframe G, I believe the call would look like the following:
myModel = mppm(Y ~ X, data=G)
After going through Chapter 16 of the SpatStat book I think that fitting a replicated LGCP Cox model might be accomplished by using the simulated intensities from calls to rLGCP, maybe like this...
myLGCP = rLGCP(model="exp",mu=0,saveLambda=TRUE,nsim=2,win=myWindow)
myIntensity = lapply(myLGCP,function(x) attributes(x)$Lambda)
G$Z = myIntensity
myModel = mppm(Y ~ X, data=G, random=~Z|id)
The above approach "runs" without errors... but I have no idea if I'm even remotely close to actually accomplishing what I wanted to do. It's also a little unclear how to use the fitted object to then simulate a realization of the model, since simulate.kppm requires a kppm object.
Thoughts and suggestions appreciated.
mppm does not currently support Cox processes.
You could do the following
Fit the trend part of the model to your replicated point pattern data using mppm, for example m <- mppm(Y ~ X, data=G)
Extract the fitted intensities for each point pattern using predict.mppm
For each point pattern, using the corresponding intensity obtained from the model, compute the inhomogeneous K function using Kinhom (with argument ratio=TRUE)
Combine the K functions using pool
Estimate the cluster parameters of the LGCP by applying lgcp.estK to the pooled K function.
Optionally after step 4 you could convert the pooled K function to a pair correlation function using pcf.fv and then fit the cluster parameters using lgcp.estpcf.
This approach assumes that the same cluster parameters will apply to each point pattern. If your data consist of several distinct groups of patterns, and you want the model to assign different cluster parameter values to the different groups of patterns, then just apply steps 4 and 5 separately to each group.
I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck
I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.
I was wondering if I get some advice about fitting hurdle models using continuous data and covariates.
I have some continuous data that are generally well fit using a right-skewed distribution such as a Pareto, Gamma, or Weibull distribution. However, there several zeros in my data which are important to my analysis. In addition, I have some categorical (two-level) covariates and would like to model the parameters of a distribution as a function of these covariates in order to formally evaluate their importance (e.g., using AIC).
I have seen examples of hurdle models fit using continuous data but have not yet found any examples of how to incorporate covariates and a model-selection framework. Does anyone have any suggestions as to how to proceed or know of any R packages that allow this procedure? I have included some code below to reproduce the type of data I am working with. The non-zero data are generated via a generalized Pareto distribution from the package texmex. The parameters were estimated directly from my non-zero data. I have also included the code to plot the data in a histogram to see their distribution.
library("texmex")
set.seed(101)
zeros <- rep(0,8)
non_zeros <- rgpd(17, sigm=exp(-10.4856), xi=0.1030, u = 0)
all.data <- c(zeros,non_zeros)
hist(non_zeros,breaks=50,xlim=c(0,0.00015),ylim=c(0,9),main="",xlab="",
col="gray")
hist(zeros,add=TRUE,col="black",breaks=100,xlim=c(0,0.00015),ylim=c(0,9))
legend("topright",legend=c("zeros"),col="black",lwd=8)