Error when trying to run fixed effects logistic regression - r

not sure where can I get help, since this exact post was considered off-topic on StackExchange.
I want to run some regressions based on a balanced panel with electoral data from Brazil focusing on 2 time periods. I want to understand if after a change in legislation that prohibited firm donations to candidates, those individuals that depended most on these resources had a lower probability of getting elected.
I have already ran a regression like this on R:
model_continuous <- plm(percentage_of_votes ~ time +
treatment + time*treatment, data = dataset, model = 'fd')
On this model I have used a continuous variable (% of votes) as my dependent variable. My treatment units or those that in time = 0 had no campaign contributions coming from corporations.
Now I want to change my dependent variable so that it is a binary variable indicating if the candidate was elected on that year. All of my units were elected on time = 0. How can I estimate a logit or probit model using fixed effects? I have tried using the pglm package in R.
model_binary <- pglm(dummy_elected ~ time + treatment + time*treatment,
data = dataset,
effects = 'twoways',
model = 'within',
family = 'binomial',
start = NULL)
However, I got this error:
Error in maxRoutine(fn = logLik, grad = grad, hess = hess, start = start, :
argument "start" is missing, with no default
Why is that happening? What is wrong with my model? Is it conceptually correct?
I want the second regression to be as similar as possible to the first one.
I have read that clogit function from the survival package could do the job, but I dont know how to do it.
Edit:
this is what a sample dataset could look like:
dataset <- data.frame(individual = c(1,1,2,2,3,3,4,4,5,5),
time = c(0,1,0,1,0,1,0,1,0,1),
treatment = c(0,0,1,1,0,0,1,1,0,0),
corporate = c(0,0,0.1,0,0,0,0.5,0,0,0))

Based on the comments, I believe the logistic regression reduces to treatment and dummy_elected. Accordingly I have fabricated the following dataset:
dataset <- data.frame("treatment" = c(rep(1,1000),rep(0,1000)),
"dummy_elected" = c(rep(1, 700), rep(0, 300), rep(1, 500), rep(0, 500)))
I then ran the GLM model:
library(MASS)
model_binary <- glm(dummy_elected ~ treatment, family = binomial(), data = dataset)
summary(model_binary)
Note that the treatment coefficient is significant and the coefficients are given. The resulting probabilities are thus
Probability(dummy_elected) = 1 => 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Probability(dummy_elected) = 0 => 1 - 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Note that these probabilities are consistent with the frequencies I generated the data.
So for each row, take the max probability across the two equations above and that's the value for dummy_elected.

Related

Non-parametric bootstrapping to generate 95% Confidence Intervals for fixed effect coefficients calculated by a glmer with nested random effects

I have an R coding question.
This is my first time asking a question here, so apologies if I am unclear or do something wrong.
I am trying to use a Generalized Linear Mixed Model (GLMM) with Poisson error family to test for any significant effect on a count response variable by three separate dichotomous variables (AGE = ADULT or JUVENILE, SEX = MALE or FEMALE and MEDICATION = NEW or OLD) and an interaction between AGE and MEDICATION (AGE:MEDICATION).
There is some dependency in my data in that the data was collected from a total of 22 different sites (coded as SITE vector with 33 distinct levels), and the data was collected over a total of 21 different years (coded as YEAR vector with 21 distinct levels, and treated as a categorical variable). Unfortunately, every SITE was not sampled for each YEAR, with some being sampled for a greater number of years than others.
The data is also quite sparse, in that I do not have a great number of measurements of the response variable (coded as COUNT and an integer vector) per SITE per YEAR.
My Poisson GLMM is constructed using the following code:
model <- glmer(data = mydata,
family = poisson(link = "log"),
formula = COUNT ~ SEX + SEX:MEDICATION + AGE + AGE:SEX + MEDICATION + AGE:MEDICATION + (1|SITE/YEAR),
offset = log(COUNT.SAMPLE.SIZE),
nAGQ = 0)
In order to try and obtain more reliable estimates for the fixed effect coefficients (particularly given the sparse nature of my data), I am trying to obtain 95% confidence intervals for the fixed effect coefficients through non-parametric bootstrapping.
I have come across the "glmmboot" package which can be used to conduct non-parametric bootstrapping of GLMMs, however when I try to run the non-parametric bootstrapping using the following code:
library(glmmboot)
bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000)
When I run this code, I receive the following message:
Performing case resampling (no random effects)
Naturally, though, my model does have random effects, namely (1|SITE/YEAR).
If I try to tell the function to resample from a specific block, by adding in the "reample_specific_blocks" argument, i.e.:
library(glmmboot)
bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000,
resample_specific_blocks = "YEAR")
Then I get the following error message:
Performing block resampling, over SITE
Error: Invalid grouping factor specification, YEAR:SITE
I get a similar error message if I try set 'resample_specific_blocks' to "SITE".
If I then try to set 'resample_specific_blocks' to "SITE:YEAR" or "SITE/YEAR" I get the following error message:
Error in bootstrap_model(base_model = model, base_data = mydata, resamples = 1000, :
No random columns from formula found in resample_specific_blocks
I have tried explicitly nesting YEAR within SITE and then adapting the model accordingly using the code:
mydata <- within(mydata, SAMPLE <- factor(SITE:YEAR))
model.refit <- glmer(data = mydata,
family = poisson(link = "log"),
formula = COUNT ~ SEX + AGE + MEDICATION + AGE:MEDICATION + (1|SITE) + (1|SAMPLE),
offset = log(COUNT.SAMPLE.SIZE),
nAGQ = 0)
bootstrap_model(base_model = model.refit,
base_data = mydata,
resamples = 1000,
resample_specific_blocks = "SAMPLE")
But unfortunately I just get this error message:
Error: Invalid grouping factor specification, SITE
The same error message comes up if I set resample_specific_blocks argument to SITE, or if I just remove the resample_specific_blocks argument.
I believe that the case_bootstrap() function found in the lmeresampler package could potentially be another option, but when I look into the help for it it looks like I would need to create a function and I unfortunately have no experience with creating my own functions within R.
If anyone has any suggestions on how I can get the bootstrap_model() function in the glmmboot package to recognise the random effects in my model/dataframe, or any suggestions for alternative methods on conducting non-parametric bootstrapping to create 95% confidence intervals for the coefficients of the fixed effects in my model, it would be greatly appreciated! Many thanks in advance, and for reading such a lengthy question!
For reference, I include links to the RDocumentation and GitHub for the glmmboot package:
https://www.rdocumentation.org/packages/glmmboot/versions/0.6.0
https://github.com/ColmanHumphrey/glmmboot
The following is code that will allow for creation of a reproducible example using the data set from lme4::grouseticks
#Load in required packages
library(tidyverse)
library(lme4)
library(glmmboot)
library(psych)
#Load in the grouseticks dataframe
data("grouseticks")
tibble(grouseticks)
#Create dummy vectors for SEX, AGE and MEDICATION
set.seed(1)
SEX <-sample(1:2, size = 403, replace = TRUE)
SEX <- as.factor(ifelse(SEX == 1, "MALE", "FEMALE"))
set.seed(2)
AGE <- sample(1:2, size = 403, replace = TRUE)
AGE <- as.factor(ifelse(AGE == 1, "ADULT", "JUVENILE"))
set.seed(3)
MEDICATION <- sample(1:2, size = 403, replace = TRUE)
MEDICATION <- as.factor(ifelse(MEDICATION == 1, "OLD", "NEW"))
grouseticks$SEX <- SEX
grouseticks$AGE <- AGE
grouseticks$MEDICATION <- MEDICATION
#Use the INDEX vector to create a vector of sample sizes per LOCATION
#per YEAR
grouseticks$INDEX <- 1
sample.sizes <- grouseticks %>%
group_by(LOCATION, YEAR) %>%
summarise(SAMPLE.SIZE = sum(INDEX))
#Combine the dataframes together into the dataframe to be used in the
#model
mydata$SAMPLE.SIZE <- as.integer(mydata$SAMPLE.SIZE)
#Create the Poisson GLMM model
model <- glmer(data = mydata,
family = poisson(link = "log"),
formula = TICKS ~ SEX + SEX + AGE + MEDICATION + AGE:MEDICATION + (1|LOCATION/YEAR),
nAGQ = 0)
#Attempt non-parametric bootstrapping on the model to get 95%
#confidence intervals for the coefficients of the fixed effects
set.seed(1)
Model.bootstrap <- bootstrap_model(base_model = model,
base_data = mydata,
resamples = 1000)
Model.bootstrap

Calculate indirect effect of 1-1-1 (within-person, multilevel) mediation analyses

I have data from an Experience Sampling Study, which consists of 8140 observations nested in 106 participants. I want to test if there is a mediation, in which I also want to compare the predictors (X1= socialInteraction_tech, X2= socialInteraction_ftf, M = MPEE_int, Y= wellbeing). X1, X2, and M are person-mean centred in order to obtain the within-person effects. To account for the autocorrelation I have fit a model with an ARMA(2,1) structure. We control for time with the variable "obs".
This is the final model including all variables of interest:
fit_mainH1xmy <- lme(fixed = wellbeing ~ 1 + obs # Controls
+ MPEE_int_centred + socialInteraction_tech_centred + socialInteraction_ftf_centred,
random = ~ 1 + obs | ID, correlation = corARMA(form = ~ obs | ID, p = 2, q = 1),
data = file, method = "ML", na.action=na.exclude)
summary(fit_mainH1xmy)
The mediation is partial, as my predictor X still significantly predicts Y after adding M.
However, I can't find a way to calculate c'(cprime), the indirect effect.
I have found the mlma package, but it looks weird and requires me to do transformations to my data.
I have tried melting the data in a long format and using lmer() to fit the model (following https://quantdev.ssri.psu.edu/sites/qdev/files/ILD_Ch07_2017_Within-PersonMedationWithMLM.html), but lmer() does not let me take into account the moving average (MA-part of the ARMA(2,1) structure).
Does anyone know how I could now obtain the indirect effect?

Predict out of sample on fixed effects model

Let's consider model :
library(plm)
data("Produc", package = "plm")
model <- plm(pcap ~ hwy + water, data = Produc, model = 'within')
To calculate fitted value of the model we just need to use :
predict(model)
However, when trying to do this out of sample :
predict(model, newdata = data.frame('hwy' = 1, 'water' = 1))
Will get error :
Error in crossprod(beta, t(X)) : non-conformable arguments
Which is quite strange for me because this code will work for any model expect 'within'. I search that there is a function fixef which do predictions on fixed effect model but unfortunately - only in sample.
So : Is there any solution how can we predict out of sample on fixed effect model ?
Just delete intercept for model :
model <- plm(pcap ~ 0 + hwy + water, data = Produc, model = 'within')
predict(model, newdata = data.frame('hwy' = 1, 'water' = 1))
3.980911
Regarding out-of-sample prediction with fixed effects models, it is not clear how data relating to fixed effects not in the original model are to be treated, e.g., data for an individual not contained in the orignal data set the model was estimated on. (This is rather a methodological question than a programming question).
The version 2.6-2 of plm now allows predict for fixed effect models with the original data and with out-of-sample data (see ?predict.plm).
Find below an example with 10 firms for model estimation and the data to be used for prediction contains a firm not contained in the original data set (besides that firm, there are also years not contained in the original model object but these are irrelevant here as it is a one-way individual model). It is unclear what the fixed effect of that out-of-sample firm would be. Hence, by default, no predicted value is given (NA value). If argument na.fill is set to TRUE, the (weighted) mean of the fixed effects contained in the original model object is used as a best guess.
library(plm)
data("Grunfeld", package = "plm")
# fit a fixed effect model
fit.fe <- plm(inv ~ value + capital, data = Grunfeld, model = "within")
# generate 55 new observations of three firms used for prediction:
# * firm 1 with years 1935:1964 (has out-of-sample years 1955:1964),
# * firm 2 with years 1935:1949 (all in sample),
# * firm 11 with years 1935:1944 (firm 11 is out-of-sample)
set.seed(42L)
new.value2 <- runif(55, min = min(Grunfeld$value), max = max(Grunfeld$value))
new.capital2 <- runif(55, min = min(Grunfeld$capital), max = max(Grunfeld$capital))
newdata <- data.frame(firm = c(rep(1, 30), rep(2, 15), rep(11, 10)),
year = c(1935:(1935+29), 1935:(1935+14), 1935:(1935+9)),
value = new.value2, capital = new.capital2)
# make pdata.frame
newdata.p <- pdata.frame(newdata, index = c("firm", "year"))
## predict from fixed effect model with new data as pdata.frame
predict(fit.fe, newdata = newdata.p) # has NA values for the 11'th firm
## set na.fill = TRUE to have the weighted mean used to for fixed effects -> no NA values
predict(fit.fe, newdata = newdata.p, na.fill = TRUE)
NB: When you input a plain data.frame as newdata, it is not clear how the data related to the individuals and time periods, which is why the weighted mean of fixed effects from the original model object is used for all observations in newdata and a warning is printed. For fixed effect model prediction, it is reasonable to assume the user can provide information (via a pdata.frame) how the data the user wants to use for prediction relates to the individual and time dimension of panel data.

How can I make logistic model with this data?

http://www.statsci.org/data/oz/snails.txt
You can get data from here.
My data is 4*3*3*2 completely randomized design experiment data. I want to model the probability of survival in terms of the stimulus variables.
I tried ANOVA, but I'm not sure whether it's right or not.
Because I want to model the "probability", should I use logistic model??
(I also tried logistic model. But the data shows the sum of 0(Survived) and 1(Deaths). Even though it is not 0 and 1, can I use logistic??)
I want to put "probability" as Y variable.
So I used logit but it's not working.
The program says that y is Inf.
How can I use logit as Y variable in aov?
glm_a <- glm(Deaths ~ Exposure + Rel.Hum + Temp + Species, data = data,
family = binomial)
prob <- Deaths / 20
logitt <- log(prob / (1 - prob))
logmodel <- lm(logitt ~ data$Species + data$Exposure + data$Rel.Hum + data$Temp)
summary(logmodel)
A <- factor(data$Species, levels = c("A", "B"), labels = c(-1, 1))
glm_a <- glm(Y ~ data$Species * data$Exposure * data$Rel.Hum * data$Temp,
data=data, family = binomial)
summary(glm_a)
help("glm") should direct you to help("family"), which reveals the following
For the binomial and quasibinomial families the response can be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not having the first level (and hence usually of having the second level).
As a numerical vector with values between 0 and 1, interpreted as the proportion of successful cases (with the total number of cases given by the weights).
As a two-column integer matrix: the first column gives the number of successes and the second the number of failures.
So for the question "How can I make logistic model with this data?", we can go with route #3 quite easily:
data <- read.table("http://www.statsci.org/data/oz/snails.txt", header = TRUE)
glm_a <- glm(cbind(Deaths, N - Deaths) ~ Species * Exposure * Rel.Hum * Temp,
data = data, family = binomial)
summary(glm_a)
# [output omitted]
As for the question "I tried ANOVA, but I'm not sure whether it's right or not. Because I want to model the "probability", should I use logistic model?", it's better to ask on Cross Validated

Using CARET together with GAM ("gamSpline" method) in R Poisson Regression

I am trying to use caret package to tune 'df' parameter of a gam model for my cohort analysis.
With the following data:
cohort = 1:60
age = 1:26
grid = data.frame(expand.grid(age = age, cohort = cohort))
size = data.frame(cohort = cohort, N = sample(100:150,length(cohort), replace = TRUE))
df = merge(grid, size, by = "cohort")
log_k = -3 + log(df$N) - 0.5*log(df$age) + df$cohort*(df$cohort-30)*(df$cohort-50)/20000 + runif(nrow(df),min = 0, max = 0.5)
df$conversion = rpois(nrow(df),exp(log_k))
Explanation of the data : Cohort number is the time of arrival of the potential customer. N is the number of potential customer that arrived at that time. Conversion is the number out of those potential customer that 'converted' (bought something). Age is the age (time spent from arrival) of the cohort when conversion took place. For a given cohort there are fewer conversions as age grows. This effect follows a power law.
But the total conversion rate of each cohort can also change slowly in time (cohort number). Thus I want a smoothing spline of the time variable in my model.
I can fit a gam model from package gam
library(gam)
fit = gam(conversion ~ log(N) + log(age) + s(cohort, df = 4), data = df, family = poisson)
fit
> Call:
> gam(formula = conversion ~ log(N) + log(age) + s(cohort, df = 4),
> family = poisson, data = df)
> Degrees of Freedom: 1559 total; 1553 Residual
> Residual Deviance: 1869.943
But if i try to train the model using the CARET package
library(caret)
fitControl = trainControl(verboseIter = TRUE)
fit.crt = train(conversion ~ log(N) + log(age) + s(cohort,df),
data = df, method = "gamSpline",
trControl = fitControl, tune.length = 3, family = poisson)
I get this error :
+ Resample01: df=1
model fit failed for Resample01: df=1 Error in as.matrix(x) : object 'N' not found
- Resample01: df=1
+ Resample01: df=2
model fit failed for Resample01: df=2 Error in as.matrix(x) : object 'N' not found .....
Please does anyone know what I'm doing wrong here?
Thanks
There are a two things wrong with your code.
The train function can be a bit tedious depending on the method you used (as you have noticed). In the case of method = "gamSpline", the train function adds a smooth term to every independent term in the formula. So it converts your variables to s(log(N), df), s(log(age) df) and to s(s(cohort, df), df).
Wait s(s(cohort, df), df) does not really makes sense. So you must change s(cohort, df) to cohort.
I am not sure why, but the train with method = "gamSpline" does not like it when you put functions (e.g. log) in the formula. I think this is due to the fact that this method already applies the s() functions to your variables. This problem can be solved by applying the log earlier to your variables. Such as df$N <- log(df$N) or logN <- log(df$N) and use logN as variable. And of course, do the same for age.
My guess is that you don't want this method to apply a smoothing term to all your independent variables based on the code you provided. I am not sure if this is possible and how to do it, if it is possible.
Hope this helps.
EDIT: If you want a more elegant solution than the one I provided at point 2, make sure to read the comment of #topepo. This suggestion also allows you to apply s() function to the variables you want if I understand it correctly.

Resources