How can I make logistic model with this data? - r

http://www.statsci.org/data/oz/snails.txt
You can get data from here.
My data is 4*3*3*2 completely randomized design experiment data. I want to model the probability of survival in terms of the stimulus variables.
I tried ANOVA, but I'm not sure whether it's right or not.
Because I want to model the "probability", should I use logistic model??
(I also tried logistic model. But the data shows the sum of 0(Survived) and 1(Deaths). Even though it is not 0 and 1, can I use logistic??)
I want to put "probability" as Y variable.
So I used logit but it's not working.
The program says that y is Inf.
How can I use logit as Y variable in aov?
glm_a <- glm(Deaths ~ Exposure + Rel.Hum + Temp + Species, data = data,
family = binomial)
prob <- Deaths / 20
logitt <- log(prob / (1 - prob))
logmodel <- lm(logitt ~ data$Species + data$Exposure + data$Rel.Hum + data$Temp)
summary(logmodel)
A <- factor(data$Species, levels = c("A", "B"), labels = c(-1, 1))
glm_a <- glm(Y ~ data$Species * data$Exposure * data$Rel.Hum * data$Temp,
data=data, family = binomial)
summary(glm_a)

help("glm") should direct you to help("family"), which reveals the following
For the binomial and quasibinomial families the response can be specified in one of three ways:
As a factor: ‘success’ is interpreted as the factor not having the first level (and hence usually of having the second level).
As a numerical vector with values between 0 and 1, interpreted as the proportion of successful cases (with the total number of cases given by the weights).
As a two-column integer matrix: the first column gives the number of successes and the second the number of failures.
So for the question "How can I make logistic model with this data?", we can go with route #3 quite easily:
data <- read.table("http://www.statsci.org/data/oz/snails.txt", header = TRUE)
glm_a <- glm(cbind(Deaths, N - Deaths) ~ Species * Exposure * Rel.Hum * Temp,
data = data, family = binomial)
summary(glm_a)
# [output omitted]
As for the question "I tried ANOVA, but I'm not sure whether it's right or not. Because I want to model the "probability", should I use logistic model?", it's better to ask on Cross Validated

Related

Simulating logistic regression from saved estimates in R

I have a bit of an issue. I am trying to develop some code that will allow me to do the following: 1) run a logistic regression analysis, 2) extract the estimates from the logistic regression analysis, and 3) use those estimates to create another logistic regression formula that I can use in a subsequent simulation of the original model. As I am, relatively new to R, I understand I can extract these coefficients 1-by-1 through indexing, but it is difficult to "scale" this to models with different numbers of coefficients. I am wondering if there is a better way to extract the coefficients and setup the formula. Then, I would have to develop the actual variables, but the development of these variables would have to be flexible enough for any number of variables and distributions. This appears to be easily done in Mplus (example 12.7 in the Mplus manual), but I haven't figured this out in R. Here is the code for as far as I have gotten:
#generating the data
set.seed(1)
gender <- sample(c(0,1), size = 100, replace = TRUE)
age <- round(runif(100, 18, 80))
xb <- -9 + 3.5*gender + 0.2*age
p <- 1/(1 + exp(-xb))
y <- rbinom(n = 100, size = 1, prob = p)
#grabbing the coefficients from the logistic regression model
matrix_coef <- summary(glm(y ~ gender + age, family = "binomial"))$coefficients
the_estimates <- matrix_coef[,1]
the_estimates
the_estimates[1]
the_estimates[2]
the_estimates[3]
I just cannot seem to figure out how to have R create the formula with the variables (x's) and the coefficients from the original model in a flexible manner to accommodate any number of variables and different distributions. This is not class assignment, but a necessary piece for the research that I am producing. Any help will be greatly appreciated, and please, treat this as a teaching moment. I really want to learn this.
I'm not 100% sure what your question is here.
If you want to simulate new data from the same model with the same predictor variables, you can use the simulate() method:
dd <- data.frame(y, gender, age)
## best practice when modeling in R: take the variables from a data frame
model <- glm(y ~ gender + age, data = dd, family = "binomial")
simulate(model)
You can create multiple replicates by specifying the nsim= argument (or you can simulate anew every time through a for() loop)
If you want to simulate new data from a different set of predictor variables, you have to do a little bit more work (some model types in R have a newdata= argument, but not GLMs alas):
## simulate new model matrix (including intercept)
simdat <- cbind(1,
gender = rbinom(100, prob = 0.5, size = 1),
age = sample(18:80, size = 100, replace = TRUE))
## extract inverse-link function
invlink <- family(model)$linkinv
## sample new values
resp <- rbinom(n = 100, size = 1, prob = invlink(simdat %*% coef(model)))
If you want to do this later from coefficients that have been stored, substitute the retrieved coefficient vector for coef(model) in the code above.
If you want to flexibly construct formulas, reformulate() is your friend — but I don't see how it fits in here.
If you want to (say) re-fit the model 1000 times to new responses simulated from the original model fit (same coefficients, same predictors: i.e. a parametric bootstrap), you can do something like this.
nsim <- 1000
res <- matrix(NA, ncol = length(coef(model)), nrow = nsim)
for (i in 1:nsim) {
## simulate returns a list (in this case, of length 1);
## extract the response vector
newresp <- simulate(model)[[1]]
newfit <- update(model, newresp ~ .)
res[i,] <- coef(newfit)
}
You don't have to store coefficients - you can extract/compute whatever model summaries you like (change the number of columns of res appropriately).
Let’s say your data matrix including age and gender, or whatever predictors, is X. Then you can use X on the right-hand side of your glm formula, get xb_hat <- X %*% the_estimates (or whatever other data matrix replacing X as long as it has same columns) and plug xb_hat into whatever link function you want.

Calculate indirect effect of 1-1-1 (within-person, multilevel) mediation analyses

I have data from an Experience Sampling Study, which consists of 8140 observations nested in 106 participants. I want to test if there is a mediation, in which I also want to compare the predictors (X1= socialInteraction_tech, X2= socialInteraction_ftf, M = MPEE_int, Y= wellbeing). X1, X2, and M are person-mean centred in order to obtain the within-person effects. To account for the autocorrelation I have fit a model with an ARMA(2,1) structure. We control for time with the variable "obs".
This is the final model including all variables of interest:
fit_mainH1xmy <- lme(fixed = wellbeing ~ 1 + obs # Controls
+ MPEE_int_centred + socialInteraction_tech_centred + socialInteraction_ftf_centred,
random = ~ 1 + obs | ID, correlation = corARMA(form = ~ obs | ID, p = 2, q = 1),
data = file, method = "ML", na.action=na.exclude)
summary(fit_mainH1xmy)
The mediation is partial, as my predictor X still significantly predicts Y after adding M.
However, I can't find a way to calculate c'(cprime), the indirect effect.
I have found the mlma package, but it looks weird and requires me to do transformations to my data.
I have tried melting the data in a long format and using lmer() to fit the model (following https://quantdev.ssri.psu.edu/sites/qdev/files/ILD_Ch07_2017_Within-PersonMedationWithMLM.html), but lmer() does not let me take into account the moving average (MA-part of the ARMA(2,1) structure).
Does anyone know how I could now obtain the indirect effect?

Log-binomial regression for binary outcome with multiple category predictors and numeric predictors

I'm trying to get RR from log-binomial regression with binary outcome. There are two categorical variables: treatment and group, two numeric variables: age and BMI.
But I get an error
Error: cannot find valid starting values: please specify some. May I ask how can I fix this error?
N <- 50
data.1 <- data.frame(Outcome=sample(c(0, 0, 1), N, rep=T), Age=runif(N, 8, 58),
BMI=rnorm(N, 25, 6), Group=rep(c(0, 1), length.out=N),
treatment=rep(c('1', '2', '3'), length.out=N))
data.1$Group <- as.factor(data.1$Group)
coefini <- exp(coef(glm(Outcome ~ Group + treatment + Age + BMI, data=data.1,
family=binomial(link="logit"))))
fit2 <- glm(Outcome ~ Group + treatment + Age + BMI, data=data.1,
family=binomial(link="log"), start=coefini)
Seems to be because the coefficients from logistic regression don't work for the log binomial regression. Replace the third line with coefini=coef(glm(Outcome~Group+treatment+Age+BMI, data=data.1,family =binomial(link = "log") )) and it works. (Remove the exp and change the link to log.)
Logistic coefficients can produce a positive linear combination, which in log-binomial regression means P(success) > 1. It looks like R returns this Error when that happens. Finding starting values that don't lead to that situation can be challenging.
A question/answer in another community was useful to me in computing starting values for log-binomial regression.
That question is about the starting values for LOGISTIC regression, and its answer is that "you can get them using [linear regression] by regressing the logit of the response, y, on the predictors with weight ny(1-y)"
In that question, the binary (1/0) dependent variable is REMISS. And the answer provides these steps to obtain starting values for a logistic regression:
y=.1*(remiss=0)+.9*(remiss=1)
logit=log(y/(1-y))
wt=y*(1-y)
Then the starting values come from the weighted linear regression of LOGIT with the predictors of interest.
But I adjusted a couple of things for link="log" instead of "logit": Instead of
logit = log(y/(1-y))
wt = y*(1-y)
I used
depvar = log(y)
wt = y/(1-y)
Then the starting values for the log-binomial model are the results of the weighted linear regression of depvar with the same predictors.
I am more into Python than R these days, but I believe with your example, that means the following:
data.1['y'] <- 0.1 * (data.1['Outcome']==0) + 0.9 * (data.1['Outcome']==1)
data.1['depvar'] <- log(data.1['y'])
data.1['wt'] <- data.1['y'] / (1 - data.1['y'])
coefini <- coef(glm(formula = depvar ~ Group + treatment + Age + BMI, weights=wt, data = data.1))
fit2 <- glm(Outcome ~ Group + treatment + Age + BMI, data=data.1,
family=binomial(link="log"), start=coefini)

Fit binomial GLM on probabilities (i.e. using logistic regression for regression not classification)

I want to use a logistic regression to actually perform regression and not classification.
My response variable is numeric between 0 and 1 and not categorical. This response variable is not related to any kind of binomial process. In particular, there is no "success", no "number of trials", etc. It is simply a real variable taking values between 0 and 1 depending on circumstances.
Here is a minimal example to illustrate what I want to achieve
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10))
fit <- glm(formula = resp ~ a + b,
family = "binomial",
data = dummy_data)
This code gives a warning then fails because I am trying to fit the "wrong kind" of data:
In eval(family$initialize) : non-integer #successes in a binomial glm!
Yet I think there must be a way since the help of family says:
For the binomial and quasibinomial families the response can be
specified in one of three ways: [...] (2) As a numerical vector with
values between 0 and 1, interpreted as the proportion of successful
cases (with the total number of cases given by the weights).
Somehow the same code works using "quasibinomial" as the family which makes me think there may be a way to make it work with a binomial glm.
I understand the likelihood is derived with the assumption that $y_i$ is in ${0, 1}$ but, looking at the maths, it seems like the log-likelihood still makes sense with $y_i$ in $[0, 1]$. Am I wrong?
This is because you are using the binomial family and giving the wrong output. Since the family chosen is binomial, this means that the outcome has to be either 0 or 1, not the probability value.
This code works fine, because the response is either 0 or 1.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = sample(c(0,1),10,replace=T,prob=c(.5,.5)) )
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data)
If you want to model the probability directly you should include an additional column with the total number of cases. In this case the probability you want to model is interpreted as the success rate given the number of case in the weights column.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data, weights = w)
You will still get the warning message, but you can ignore it, given these conditions:
resp is the proportion of 1's in n trials.
for each value in resp, the corresponding value in w is the number of trials.
From the discussion at Warning: non-integer #successes in a binomial glm! (survey packages), I think we can solve it by another family function ?quasibinomial().
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit2 <- glm(formula = resp ~ a + b,
family = quasibinomial(),
data = dummy_data, weights = w)

Error when trying to run fixed effects logistic regression

not sure where can I get help, since this exact post was considered off-topic on StackExchange.
I want to run some regressions based on a balanced panel with electoral data from Brazil focusing on 2 time periods. I want to understand if after a change in legislation that prohibited firm donations to candidates, those individuals that depended most on these resources had a lower probability of getting elected.
I have already ran a regression like this on R:
model_continuous <- plm(percentage_of_votes ~ time +
treatment + time*treatment, data = dataset, model = 'fd')
On this model I have used a continuous variable (% of votes) as my dependent variable. My treatment units or those that in time = 0 had no campaign contributions coming from corporations.
Now I want to change my dependent variable so that it is a binary variable indicating if the candidate was elected on that year. All of my units were elected on time = 0. How can I estimate a logit or probit model using fixed effects? I have tried using the pglm package in R.
model_binary <- pglm(dummy_elected ~ time + treatment + time*treatment,
data = dataset,
effects = 'twoways',
model = 'within',
family = 'binomial',
start = NULL)
However, I got this error:
Error in maxRoutine(fn = logLik, grad = grad, hess = hess, start = start, :
argument "start" is missing, with no default
Why is that happening? What is wrong with my model? Is it conceptually correct?
I want the second regression to be as similar as possible to the first one.
I have read that clogit function from the survival package could do the job, but I dont know how to do it.
Edit:
this is what a sample dataset could look like:
dataset <- data.frame(individual = c(1,1,2,2,3,3,4,4,5,5),
time = c(0,1,0,1,0,1,0,1,0,1),
treatment = c(0,0,1,1,0,0,1,1,0,0),
corporate = c(0,0,0.1,0,0,0,0.5,0,0,0))
Based on the comments, I believe the logistic regression reduces to treatment and dummy_elected. Accordingly I have fabricated the following dataset:
dataset <- data.frame("treatment" = c(rep(1,1000),rep(0,1000)),
"dummy_elected" = c(rep(1, 700), rep(0, 300), rep(1, 500), rep(0, 500)))
I then ran the GLM model:
library(MASS)
model_binary <- glm(dummy_elected ~ treatment, family = binomial(), data = dataset)
summary(model_binary)
Note that the treatment coefficient is significant and the coefficients are given. The resulting probabilities are thus
Probability(dummy_elected) = 1 => 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Probability(dummy_elected) = 0 => 1 - 1 / (1 + Exp(-(1.37674342264577E-16 + 0.847297860386033 * :treatment)))
Note that these probabilities are consistent with the frequencies I generated the data.
So for each row, take the max probability across the two equations above and that's the value for dummy_elected.

Resources