I am trying to estimate a higher order model with the mirt package in R.
I have already conducted analysis using a bi-factor structure.
I have 60 items, the first 30 being one and the second 30 being a second dimensions of my model. I know that these two dimensions are strongly correlated (r = .8), which was the motivation to estimate a bi-factor model. However, I would like to compare the bi-factor structure to a model in which both subdimensions are explained by a general factor G.
I have troubles finding the correct model specification to estimate the model.
`specific <- c(rep(1,30),rep(2,30)) `
model2 <- '
F1 = 1-60 '
mod <- bfactor(all_data,specific, model2 = model2,itemtype = "Rasch")
However this only gives me a model with ageneral factors and two specific factors, essentially a bi-factor model. What I am trying to specify is that (a) the two lower tier factors both load on a general factor, while the lower tier factors are uncorrelated and (b) the same models but allowing the lower tier factors to correlate.
I am happy if someone could help me out here.
Related
I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).
I'm currently trying to use a GAM to calculate a rough estimation of expected goals model based purely on the commentary data from ESPN. However, all the data is either a categorical variable or a logical vector, so I'm not sure if there's a way to smooth, or if I should just use the factor names.
Here are my variables:
shot_where (factor): shot location (e.g. right side of the box)
assist_class (factor): type of assist (cross, through ball, pass)
follow_corner (logical): whether the shot follows a corner
shot_with (factor): right foot, left food, header
follow_set_piece (logical): whether the shot follows a set piece
I think I should just use the formula as just the variable names.
model <- bam(is_goal ~ shot_where + assist_class + follow_set_piece + shot_where + follow_corner + shot_where:shot_with, family = "binomial", method = "REML")
The shot_where and shot_with would incorporate any interactions between these two varaibles.
However, I was told I could smooth factor variables as well using the below structure.
model <- bam(is_goal ~ s(shot_where, bs = 'fs') + s(assist_class, bs = 'fs') + as.logical(follow_set_piece) +
as.logical(follow_corner) + s(shot_with, bs = 'fs'), data = model_data, family = "binomial", method = "REML")
This worked for creating a model, but I want to make sure this is a correct method of building the model. I've yet to see any information on using only factor/logical variables in a GAM model, so I thought it was worth asking.
If you only have categorical covariates then you aren't fitting a GAM, whether you fit the model with gam(), bam(), or something else.
What you are doing when you pass factor variables to s() using the fs basis like this
s(f, bs = 'fs')`
is creating a random intercept for each level of the factor f.
There's no smoothing going on here at all; the model is simply exploiting the equivalence of the Bayesian view of smoothing with random effects.
Given that none of your covariates could reasonably be considered random in the sense of a mixed effects model then the only justification for doing what you're doing might be as a computational trick.
Your first model is just a simple GLM (note the typo in the formula as shot_where is repeated twice in the formula.)
It's not clear to me why you are using bam() to fit this model; you're loosing computational efficiency that bam() provides by using method = 'REML'; it should be 'fREML' for bam() models. But as there is no smoothness selection going on in the first model you'd likely be better off using glm() to fit that model. If the issue is large sample sizes, there are several packages that can fit GLMs to large data, for example biglm and it's bigglm() function.
In the second model there is no smoothing going on but there is penalisation which is shrinking the estimates for the random intercepts toward zero. You're likely to get better performance on big data using the lme4 package or TMB and the glmmTMB package to fit what is a GLMM.
This is more of a theoretical question than about R, but let me provide a brief answer. Essentially, the most flexible model you could estimate would be one where you used the variables as factors. It also produces a model that is reasonably easily interpreted - where each coefficient gives you the expected difference in y between the reference level and the level represented by the dummy regressor.
Smoothing splines try to strike the appropriate bias-variance tradeoff. If you've got lots of data and relatively few categories in the categorical variables, there will be no real loss in efficiency for including all of the dummy regressors representing the categories and the bias will also be as small as possible. To the extent that the smoothing spline model is different from the one treating everything as factors, it is likely inducing bias without a corresponding increase in efficiency. If it were me, I would stick with a model that treats all of the categorical variables as factors.
First time asking in the forums, this time I couldn't find the solutions in other answers.
I'm just starting to learn to use R, so I can't help but think this has a simple solution I'm failing to see.
I'm analyzing the relationship between different insect species (SP) and temperature (T), explanatory variables
and the area of the femur of the resulting adult (Femur.area) response variable.
This is my linear model:
ModeloP <- lm(Femur.area ~ T * SP, data=Datos)
No error, but when I want to model variance with gls,
modelo_varPower <- gls(Femur.area ~ T*SP,
weights = varPower(),
data = Datos
)
I get the following errors...
Error in glsEstimate(object, control = control) :
computed "gls" fit is singular, rank 19
The linear model barely passes the Shapiro test of normality, could this be the issue?
Shapiro-Wilk normality test
data: re
W = 0.98269, p-value = 0.05936
Strangely I've run this model using another explanatory variable and had no errors, all I can read in the forums has to do with multiple samplings along a period of time, and thats not my case.
Since the only difference is the response variable I'm uploading and image of how the table looks like in case it helps.
You have some missing cells in your SP:T interaction. lm() tolerates these (if you look at coef(lm(Femur.area~SP*T,data=Datos)) you'll see some NA values for the missing interactions). gls() does not. One way to deal with this is to create an interaction variable and drop the missing levels, then fit the model as (effectively) a one-way rather than a two-way ANOVA. (I called the data dd rather than datos.)
dd3 <- transform(na.omit(dd), SPT=droplevels(interaction(SP,T)))
library(nlme)
gls(Femur.area~SPT,weights=varPower(form=~fitted(.)),data=dd3)
If you want the main effects and the interaction term and the power-law variance that's possible, but it's harder.
I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck
I'm using the package glmnet, I need to run several LASSO analysis for the calibration of a large number of variables (%reflectance for each wavelength throughout the spectrum) against one dependent variable. I have a couple of doubts on the procedure and on the results I wish to solve. I show my provisional code below:
First I split my data in training (70% of n) and testing sets.
smp_size <- floor(0.70 * nrow(mydata))
set.seed(123)
train_ind <- sample(seq_len(nrow(mydata)), size = smp_size)
train <- mydata[train_ind, ]
test <- mydata[-train_ind, ]
Then I separate the target trait (y) and the independent variables (x) for each set as follows:
vars.train <- train[3:2153]
vars.test <- test[3:2153]
x.train <- data.matrix(vars.train)
x.test <- data.matrix(vars.test)
y.train <- train$X1
y.test <- test$X1
Afterwords, I run a cross-validated LASSO model for the training set and extract and writte the non-zero coefficients for lambdamin. This is because one of my concerns here is to note which variables (wavebands of the reflectance spectrum) are selected by the model.
install.packages("glmnet")
library(glmnet)
cv.lasso.1 <- cv.glmnet(y=y.train, x= x.train, family="gaussian", nfolds =
5, standardize=TRUE, alpha=1)
coef(cv.lasso.1,s=cv.lasso.1$lambda.min) # Using lambda min.
(cv.lasso.1)
install.packages("broom")
library(broom)
c <- tidy(coef(cv.lasso.1, s="lambda.min"))
write.csv(c, file = "results")
Finally, I use the function “predict” and apply the object “cv.lasso1” (the model obtained previously) to the variables of the testing set (x.2) in order to get the prediction of the variable and I run the correlation between the predicted and the actual values of Y for the testing set.
predict.1.2 <- predict(cv.lasso.1, newx=x.2, type = "response", s =
"lambda.min")
cor.test(x=c(predict.1.2), y=c(y.2))
This is a simplified code and had no problem so far, the point is that I would like to make a loop (of one hundred repetitions) of the whole code and get the non-zero coefficients of the cross-validated model as well as the correlation coefficient of the predicted vs actual values (for the testing set) for each repetition. I've tried but couldn't get any clear results. Can someone give me some hint?
thanks!
In general, running repeated analyses of the same type over and over on the same data can be tricky. And in your case, may not be necessary the way in which you have outlined it.
If you are trying to find the variables most predictive, you can use PCA, Principal Component Analysis to select variables with the most variation within the a variable AND between variables, but it does not consider your outcome at all, so if you have poor model design it will pick the least correlated data in your repository but it may not be predictive. So you should be very aware of all variables in the set. This would be a way of reducing the dimensionality in your data for a linear or logistic regression of some sort.
You can read about it here
yourPCA <- prcomp(yourData,
center = TRUE,
scale. = TRUE)
Scaling and centering are essential to making these models work right, by removing the distance between your various variables setting means to 0 and standard deviations to 1. Unless you know what you are doing, I would leave those as they are. And if you have skewed or kurtotic data, you might need to address this prior to PCA. Run this ONLY on your predictors...keep your target/outcome variable out of the data set.
If you have a classification problem you are looking to resolve with much data, try an LDA, Linear Discriminant Analysis which looks to reduce variables by optimizing the variance of each predictor with respect to the OUTCOME variable...it specifically considers your outcome.
require(MASS)
yourLDA =r <- lda(formula = outcome ~ .,
data = yourdata)
You can also set the prior probabilities in LDA if you know what a global probability for each class is, or you can leave it out, and R/ lda will assign the probabilities of the actual classes from a training set. You can read about that here:
LDA from MASS package
So this gets you headed in the right direction for reducing the complexity of data via feature selection in a computationally solid method. In looking to build the most robust model via repeated model building, this is known as crossvalidation. There is a cv.glm method in boot package which can help you get this taken care of in a safe way.
You can use the following as a rough guide:
require(boot)
yourCVGLM<- cv.glmnet(y = outcomeVariable, x = allPredictorVariables, family="gaussian", K=100) .
Here K=100 specifies that you are creating 100 randomly sampled models from your current data OBSERVATIONS not variables.
So the process is two fold, reduce variables using one of the two methods above, then use cross validation to build a single model from repeated trials without cumbersome loops!
Read about cv.glm here
Try starting on page 41, but look over the whole thing. The repeated sampling you are after is called booting and it is powerful and available in many different model types.
Not as much code and you might hope for, but pointing you in a decent direction.