I have to adjust logistic regression model for cluster standard errors. For this purpose I use the package ClusterSEs and the command cluster.im.
I have to levels in the dataset Tbf2: Individual and village:
Tbf2 is my small dataset consisting of the variable Burned (binary variable; village level), Village (factor, village level) and VoteForER2 (binary, individual level).
My code is provided below:
#Make sure the data has the same length,
Tbf1 <- data.frame(cbind(Burned, Village, VoteForER2))
Tbf2 <- na.omit(Tbf1)
#Prediction of support for Authorities on Burned
###ER2 ; logistic regression
fm <- glm(Tbf2$VoteForER2 ~ Tbf2$Burned + Tbf2$, family=binomial(link="logit"))
display(fm)
#Adjusted p-values
clust.p <- cluster.im(fm, Tbf2, Village, ci.level = 0.95, report = T, drop = FALSE)
My problem is, that I keep getting the following error-message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
And I can't figure out how to solve this. I have two different levels in regression model as far as I can see myself.
I hope somebody will be able to help me!
Best,
Sofie
The cluster.im function works the following way:
Computes p-values and confidence intervals for GLM models based on cluster-specific model estimation
(Ibragimov and Muller 2010). A separate model is estimated in each cluster, and then pvalues
and confidence intervals are computed based on a t/normal distribution of the cluster-specific
estimates.
Model cannot be estimated for each cluster because the Burned variable for each village is the same for all people - the whole village either burned or not. So the model ask for more data - it says "give me at least 2 different levels of the predictor".
Related
I am doing a multi-level meta-analysis. Many studies have several subgroups. When I make a forest plot studies are presented as subgroups. There are 60 of them, however, I would like to plot studies according to the study, then it would be 25 studies and it would be more appropriate. Does anyone have an idea how to do this forest plot?
I did it this way:
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | Author/Study,
test = "t",
method = "REML")
forest(full.model)
It is not clear to me if you want to aggregate to the Author level or to the Study level. If there are multiple rows of data for particular studies, then the model isn't really complete and you would want to add another random intercept for the level of the estimates within studies. Essentially, the lowest random effect should have as many values for nlvls in the output as there are estimates (k).
Let's first tackle the case where we have a multilevel structure with two levels, studies and multiple estimates within studies (for some technical reasons, some might call this a three-level model, but let's not get into this). I will use a fully reproducible example for illustration purposes, using the dat.konstantopoulos2011 dataset, where we have districts and schools within districts. We fit a multilevel model of the type as you have with:
library(metafor)
dat <- dat.konstantopoulos2011
res <- rma.mv(yi, vi, random = ~ 1 | district/school, data=dat)
res
We can aggregate the estimates to the district level using the aggregate() function, specifying the marginal var-cov matrix of the estimates from the model to account for their non-independence (note that this makes use of aggregate.escalc() which only works with escalc objects, so if it is not, you need to convert the dataset to one - see help(aggregate.escalc) for details):
agg <- aggregate(dat, cluster=dat$district, V=vcov(res, type="obs"))
agg
You will find that if you then fit an equal-effects model to these estimates based on the aggregated data that the results are identical to what you obtained from the multilevel model (we use an equal-effects model since the heterogeneity accounted for by the multilevel model is already encapsulated in vcov(res, type="obs")):
rma(yi, vi, method="EE", data=agg)
So, we can now use these aggregated values in a forest plot:
with(agg, forest(yi, vi, slab=district))
My guess based on your description is that you actually have an additional level that you should include in the model and that you want to aggregate to the intermediate level. This is a tad more complicated, since aggregate() isn't meant for that. Just for illustration purposes, say we use year as another (higher) level and I will mess a bit with the data so that all three variance components are non-zero (again, just for illustration purposes):
dat$yi[dat$year == 1976] <- dat$yi[dat$year == 1976] + 0.8
res <- rma.mv(yi, vi, random = ~ 1 | year/district/school, data=dat)
res
Now instead of aggregate(), we can accomplish the same thing by using a multivariate model, including the intermediate level as a factor and using again vcov(res, type="obs") as the var-cov matrix:
agg <- rma.mv(yi, V=vcov(res, type="obs"), mods = ~ 0 + factor(district), data=dat)
agg
Now the model coefficients of this model are the aggregated values and the var-cov matrix of the model coefficients is the var-cov matrix of these aggregated values:
coef(agg)
vcov(agg)
They are not all independent (since we haven't aggregated to the highest level), so if we want to check that we can obtain the same results as from the multilevel model, we must account for this dependency:
rma.mv(coef(agg), V=vcov(agg), method="EE")
Again, exactly the same results. So now we use these coefficients and the diagonal from vcov(agg) as their sampling variances in the forest plot:
forest(coef(agg), diag(vcov(agg)), slab=names(coef(agg)))
The forest plot cannot indicate the dependency that still remains in these values, so if one were to meta-analyze these aggregated values using only diag(vcov(agg)) as their sampling variances, the results would not be identical to what you get from the full multilevel model. But there isn't really a way around that and the plot is just a visualization of the aggregated estimates and the CIs shown are correct.
You need to specify your own grouping in a new column of data and use this as the new random effect:
df$study_group <- c(1,1,1,2,2,3,4,5,5,5) # example
full.model <- rma.mv(yi = yi,
V = vi,
slab = Author,
data = df,
random = ~ 1 | study_group,
test = "t",
method = "REML")
forest(full.model)
I have panel data from external assets of 102 countries over ~ 20-40 years, depending on the country.
I tried predicting the probability for a financial crisis, depending on log(total_liabilities to see whether an increase in foreign investment and other capital positions can help predict a crisis.
plm1 <- plm(crisis ~ log_total_liabilities + lag1_log_tot_lia + lag2_log_tot_lia + lag3_log_tot_lia
+ factor(year) + factor(country), data = dt2, index=c("year", "country"), model="pooling")
summary(plm1)
I started by estimating a plm model, regressing on my crisis dummy.
To estimate the predictive ability, I wanted to generate a ROC and AUC value, given the regression
# Plot of True Positive Rate Against the False Positive Rate
pred1 <- predict(plm1)
pred2 <- prediction(pred1,as.numeric(plm1$crisis))
plot(performance(pred2,"tpr","fpr"), las=0, main="plm1")
I get errors like:
Error: not fitting arguments / variables" (translated from German) or
"all arguments/variables need to have the same length" (translated
from German).
Another approach to obtaining Roc values would start with
When changing pred1 <- predict(plm1, dt2) (dt2 is my data frame, containing also some variables I had not used in the plm1 regression), the error differs:
The format of predictions is invalid. It couldn't be coerced to a list.
Are PLMs simply not made for ROC calculations? And if so, how come that the paper attached presents AUROC values for a linear probability model with fixed effects? (See second last row)
And if no, what am I doing wrong?
I attached the screenshot of the paper and my dataset.
CSV File with datasat
Screenshot of paper with OLS AUROC value
AUC-ROC only works for only binary classification problems. As you used a fixed effects regression, the predicted values produced after plm1, pred1, is a continuous one.
I am trying to create a GLMM in R. I want to find out how the emergence time of bats depends on different factors. Here I take the time difference between the departure of the respective bat and the sunset of the day as dependent variable (metric). As fixed factors I would like to include different weather data (metric) as well as the reproductive state (categorical) of the bats. Additionally, there is the transponder number (individual identification code) as a random factor to exclude inter-individual differences between the bats.
I first worked in R with a linear mixed model (package lme4), but the QQ plot of the residuals deviates very strongly from the normal distribution. Also a histogram of the data rather indicates a gamma distribution. As a result, I implemented a GLMM with a gamma distribution. Here is an example with one weather parameter:
model <- glmer(formula = difference_in_min ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gamma(link = log))
However, since there was no change in the QQ plot this way, I looked at the residual diagnostics of the DHARMa package. But the distribution assumption still doesn't seem to be correct, because the data in the QQ plot deviates very much here, too.
Residual diagnostics from DHARMa
But if the data also do not correspond to a gamma distribution, what alternative is there? Or maybe the problem lies somewhere else entirely.
Does anyone have an idea where the error might lie?
But if the data also do not correspond to a gamma distribution, what alternative is there?
One alternative is called the lognormal distribution (https://en.wikipedia.org/wiki/Log-normal_distribution)
Gaussian (or normal) distributions are typically used for data that are normally distributed around zero, which sounds like you do not have. But the lognormal distribution does not have the same requirements. Following your previous code, you would fit it like this:
model <- glmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gaussian(link = identity))
or instead of glmer you can just call lmer directly where you don't need to specify the distribution (which it may tell you to do in a warning message anyway:
model <- lmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl)
I have a df based on a national survey conducted every two years; the time period is 2010-14 and I filtered the df in order to have only person that appears al least two times. In this way I have a panel df but unbalanced.
I run a regression to study which variables influence the participation in complementary pension (it is voluntary in my country). I run a one-side fixed effect regression and now I want to run a two side fixed effect regression (both individual and time).
The individual variable is uid and time variable is year. I used the plm package in r:
df.p <- plm.data(df, c("uid", "year")
and run the regression:
reg1 <- plm(pens ~ woman + age + I(age^2/100) + high + medium + nord + centre, model="within", effect="twoways", data=df.p)
where high and medium are dummies regarding the education level and nord and centre regard geographic location. For the sake of simplicity I omitted other variables that are present in the original model (20 variables).
After at least 1 hour of working I run the summary command:
summary(reg1)
after another hour of working I got the error:
Error in crossprod(t(X), beta) : non-conformable arguments
so I supposed there was a multicollinearity problem. So I check the multicollinearity with the correlation matrix:
p1 <- with(df, data.frame(woman=woman, age=age, high=high, medium=medium, nord=nord, centre=centre))
round(cor(p1),3)
Consider that I created the matrix using all the variables (here omitted for the sake of simplicity, as I wrote). I didn't find any relevant value. I also check for the variance inflation factor:
vif(p1)
and I got:
No variable from the 20 input variables has collinearity problem.
At this point I suppose the the collinearity problem could be determined by the fact that I run a two side regression but I don't know how to manage the problem.
Thanks in advance.
I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.