Using generalized estimating equations to model outcomes with a population offset - r

I'm trying to use gee to model counts of an outcome with a population offset.I have models with interaction terms and am trying to use the all effects package to summarize parameter estimates and odds ratios (ORs).
When I compute ORs by hand, I'm not sure why its not matching the output I get from the effects::allEffects() function. The data can't be shared but the model is
mdl <- geeglm(count~age+gender+age:gender+offset(log(totalpop)),
family="poisson", corstr="exchangeable", id=geo,
waves=year, data=df)
I use the below code to compute stuff manually. log_OR sums the interaction terms without intercepts added to parameter. log_odds sums the parameters with intercept. The code is taken from here.
tibble(
variables = names(coef(mdl)),
log_OR = c(...),
log_odds = c(...),
OR = exp(log_OR),
odds = exp(log_odds),
probability = odds / (1 + odds)
) %>%
mutate_if(is.numeric, ~round(., 5)) %>%
knitr::kable()
I then compare my manual calculations to the output of allEffects below. They don't match. Can someone help me see what I am doing wrong?
result <- allEffects(mdl)
allEffects(mdl) %>% summary()
variable <- result[["age:gender"]][["x"]]
Prob <- result$`age:gender`$fit
Prob_upper <- result$`age:gender`$upper
Prob_lower <- result$`age:gender`$lower
model_Est <- data.frame("Est"=Prob, "CI Lower"= Prob_lower,
"CI Upper"= Prob_upper)
model_Prob <- exp(model_Est)
model_est <- data.frame("Variable"=variable, model_est)
model_OR <- data.frame("Variable"=variable, model_OR)

You haven't given us very much to go on, but the cause is almost certainly that the offset isn't being dealt with properly. (The first thing I would try is running the model without the offset to see if the results from effects and your by-hand calculations match: that's not the model you want, but it will confirm that the problem is with the offsets.)
?effects says:
offset a function to be applied to the offset values (if
there is an offset) in a linear or generalized linear
model, or a mixed-effects model fit by ‘lmer’ or ‘glmer’;
or a numeric value, to which the offset will be set. The
default is the ‘mean’ function, and thus the offset will
be set to its mean; in the case of ‘"svyglm"’ objects,
the default is to use the survey-design weighted mean.
Note: Only offsets defined by the ‘offset’ argument to
‘lm’, ‘glm’, ‘svyglm’, ‘lmer’, or ‘glmer’ will be handled
correctly; use of the ‘offset’ function in the model
formula is not supported.
(emphasis added)
methods("effects") lists only effects.glm and effects.lm, which suggests that the model is being treated as a glm (i.e., there is no specialized method for GEE models). So, this suggests:
(1) you need to include offset= as a separate argument in your model.
(2) when doing your hand calculation, make sure the value of the offset is set to the mean value across all observations (unless you choose to use the offset= argument to effects/allEffects to change the default summary function).

Related

Back transformation of emmeans in lmer

I had to transform a variable response (e.g. Variable 1) to fulfil the assumptions of linear models in lmer using an approach suggested here https://www.r-bloggers.com/2020/01/a-guide-to-data-transformation/ for heavy-tailed data and demonstrated below:
TransformVariable1 <- sqrt(abs(Variable1 - median(Variable1))
I then fit the data to the following example model:
fit <- lmer(TransformVariable1 ~ x + y + (1|z), data = dataframe)
Next, I update the reference grid to account for the transformation as suggested here Specifying that model is logit transformed to plot backtransformed trends:
rg <- update(ref_grid(fit), tran = "TransformVariable1")
Neverthess, the emmeans are not back transformed to the original scale after using the following command:
fitemm <- as.data.frame(emmeans(rg, ~ x + y, type = "response"))
My question is: How can I back transform the emmeans to the original scale?
Thank you in advance.
There are two major problems here.
The lesser of them is in specifying tran. You need to either specify one of a handful of known transformations, such as "log", or a list with the needed functions to undo the transformation and implement the delta method. See the help for make.link, make.tran, and vignette("transformations", "emmeans").
The much more serious issue is that the transformation used here is not a monotone function, so it is impossible to back-transform the results. Each transformed response value corresponds to two possible values on either side of the median of the original variable. The model we have here does not estimate effects on the given variable, but rather effects on the dispersion of that variable. It's like trying to use the speedometer as a substitute for a navigation system.
I would suggest using a different model, or at least a different response variable.
A possible remedy
Looking again at this, I wonder if what was meant was the symmetric square-root transformation -- what is shown multiplied by sign(Variable1 - median(Variable1)). This transformation is available in emmeans::make.tran(). You will need to re-fit the model.
What I suggest is creating the transformation object first, then using it throughout:
require(lme4)
requre(emmeans)
symsqrt <- make.tran("sympower", param = c(0.5, median(Variable1)))
fit <- with(symsqrt,
lmer(linkfun(Variable1) ~ x + y + (1|z), data = dataframe)
)
emmeans(fit, ~ x + y, type = "response")
symsqrt comprises a list of functions needed to implement the transformation. The transformation itself is symsqrt$linkfun, and the emmeans package knows to look for the other stuff when the response transformation is named linkfun.
BTW, please break the habit of wrapping emmeans() in as.data.frame(). That renders invisible some important annotations, and also disables the possibility of following up with contrasts and comparisons. If you think you want to see more precision than is shown, you can precede the call with emm_options(opt.digits = FALSE); but really, you are kidding yourself if you think those extra digits give you useful information.

glm summary not giving coefficients values

I'm trying to apply glm on a given dataset,but the summary(model1) is not giving me the correct output , it's not giving coefficient values for Estimate Std. Error z value Pr(>|z|) etc, it's just giving me NA as an output for individual attribute element.
TEXT <- c('Learned a new concept today : metamorphic testing. t.co/0is1IUs3aW','BMC Bioinformatics BioMed Central: Detecting novel ncRNAs by experimental #RNomics is not an easy task... http:/t.co/ui3Unxpx #bing #MyEN','BMC Bioinformatics BioMed Central: small #RNA with a regulatory function as a scientific ... Detecting novel… http:/t.co/wWHOEkR0vc #bing','True or false? link(#Addition, #Classification) http:/t.co/zMJuTFt8iq #Oxytocin','Biologists do have a sense of humor, especially computational bio people http:/t.co/wFZqaaFy')
NAME <- c('QSoft Consulting','Fabrice Leclerc','Sungsam Gong','Frederic','Zach Stednick')
SCREEN_NAME <-c ('QSoftConsulting','rnomics','sunggong','rnomics','jdwasmuth')
FOLLOWERS_COUNT <- c(734,1900,234,266,788)
RETWEET <- c(1,3,5,0,2)
FRIENDS_COUNT <-c(34,532,77,213,422)
STATUSES_COUNT <- c(234,643,899,222,226)
FAVOURITES_COUNT <- c(144,2677,445,930,254)
df <- data.frame(TEXT,NAME,SCREEN_NAME,RETWEET,FRIENDS_COUNT,STATUSES_COUNT,FAVOURITES_COUNT)
mydata<-df
mydata$FAVOURITES_COUNT <- ifelse( mydata$FAVOURITES_COUNT >= 445, 1, 0) #converting fav_count to binary values
Splitting data
library(caret)
split=0.60
trainIndex <- createDataPartition(mydata$FAVOURITES_COUNT, p=split, list=FALSE)
data_train <- mydata[ trainIndex,]
data_test <- mydata[-trainIndex,]
glm model
library(e1071)
model1 <- glm(FAVOURITES_COUNT~.,family = binomial, data = data_train)
summary(model1)
I want to get the p value for further analysis so far i think my code is right, how can i get the correct output?
A binomial distribution will only work if the dependent variable has two outcomes. You should consider a Poisson distribution when the dependent variable is a count. See here for more details: http://www.statmethods.net/advstats/glm.html
Your code for fitting the GLM is programmatically correct. However, there are a few issues:
As mentioned in the comments, for every variable that is categorical, you should use as.factor() to make it into a factor. GLM doesn't know what a "string" variable is.
As MorganBall indicated, if your data truly is count data, you may consider fitting it using a Poisson GLM, instead of converting to binary and using Logistic regression.
You indicate that you have 13 parameters and 1000 observations. While this may seem like enough data, note that some of these parameters may have very few (close to 0?) observations in them. This is a problem.
In addition, did you make sure that your data does not perfectly separate the response? Because if there are some combinations of parameters that do separate the response perfectly, the maximum likelihood estimate won't converge and theoretically goes to infinity. Practically speaking, you'll get very large standard errors for your estimates.

wrapnls: Error: singular gradient matrix at initial parameter estimates

I have created a loop to fit a non-linear model to six data points by participants (each participant has 6 data points). The first model is a one parameter model. Here is the code for that model that works great. The time variable is defined. The participant variable is the id variable. The data is in long form (one row for each datapoint of each participant).
Here is the loop code with 1 parameter that works:
1_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {wrapnls(indiff ~ 1/(1+k*time), data = discounting_long, start = c(k=0))})
However, when I try to fit a two parameter model, I get this error "Error: singular gradient matrix at initial parameter estimates" while still using the wrapnls function. I realize that the model is likely over parameterized, that is why I am trying to use wrapnls instead of just nls (or nlsList). Some in my field insist on seeing both model fits. I thought that the wrapnls model avoids the problem of 0 or near-0 residuals. Here is my code that does not work. The start values and limits are standard in the field for this model.
2_p_model <- dlply(discounting_long, .(Participant), function(discounting_long) {nlxb(indiff ~ 1/(1+k*time^s), data = discounting_long, lower = c (s = 0), start = c(k=0, s=.99), upper = c(s=1))})
I realize that I could use nlxb (which does give me the correct parameter values for each participant) but that function does not give predictive values or residuals of each data point (at least I don't think it does) which I would like to compute AIC values.
I am also open to other solutions for running a loop through the data by participants.
You mention at the end that 'nlxb doesn't give you residuals', but it does. If your result from your call to nlxbis called fit then the residuals are in fit$resid. So you can get the fitted values using just by adding them to the original data. Honestly I don't know why nlxb hasn't been made to work with the predict() function, but at least there's a way to get the predicted values.

plotting glm interactions: "newdata=" structure in predict() function

My problem is with the predict() function, its structure, and plotting the predictions.
Using the predictions coming from my model, I would like to visualize how my significant factors (and their interaction) affect the probability of my response variable.
My model:
m1 <-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1, family=binomial(logit))
mating: individual has mated or not (factor, binomial: 0,1)
pop: population (factor, 4 levels)
behv: behaviour (numeric, scaled & centered)
condition: relative fat content (numeric, scaled & centered)
Significant effects after running the glm:
pop1
condition
behv*pop2
behv^2*pop1
Although I have read the help pages, previous answers to similar questions, tutorials etc., I couldn't figure out how to structure the newdata= part in the predict() function. The effects I want to visualise (given above) might give a clue of what I want: For the "behv*pop2" interaction, for example, I would like to get a graph that shows how the behaviour of individuals from population-2 can influence whether they will mate or not (probability from 0 to 1).
Really the only thing that predict expects is that the names of the columns in newdata exactly match the column names used in the formula. And you must have values for each of your predictors. Here's some sample data.
#sample data
set.seed(16)
data <- data.frame(
mating=sample(0:1, 200, replace=T),
pop=sample(letters[1:4], 200, replace=T),
behv = scale(rpois(200,10)),
condition = scale(rnorm(200,5))
)
data1<-data[1:150,] #for model fitting
data2<-data[51:200,-1] #for predicting
Then this will fit the model using data1 and predict into data2
model<-glm ( mating ~ behv * pop +
I(behv^2) * pop + condition,
data=data1,
family=binomial(logit))
predict(model, newdata=data2, type="response")
Using type="response" will give you the predicted probabilities.
Now to make predictions, you don't have to use a subset from the exact same data.frame. You can create a new one to investigate a particular range of values (just make sure the column names match up. So in order to explore behv*pop2 (or behv*popb in my sample data), I might create a data.frame like this
popbbehv<-data.frame(
pop="b",
behv=seq(from=min(data$behv), to=max(data$behv), length.out=100),
condition = mean(data$condition)
)
Here I fix pop="b" so i'm only looking at the pop, and since I have to supply condition as well, i fix that at the mean of the original data. (I could have just put in 0 since the data is centered and scaled.) Now I specify a range of behv values i'm interested in. Here i just took the range of the original data and split it into 100 regions. This will give me enough points to plot. So again i use predict to get
popbbehvpred<-predict(model, newdata=popbbehv, type="response")
and then I can plot that with
plot(popbbehvpred~behv, popbbehv, type="l")
Although nothing is significant in my fake data, we can see that higher behavior values seem to result in less mating for population B.

Predict.lm() in R - how to get nonconstant prediction bands around fitted values

So I am currently trying to draw the confidence interval for a linear model. I found out I should use predict.lm() for this, but I have a few problems really understanding the function and I do not like using functions without knowing what's happening. I found several how-to's on this subject, but only with the corresponding R-code, no real explanation.
This is the function itself:
## S3 method for class 'lm'
predict(object, newdata, se.fit = FALSE, scale = NULL, df = Inf,
interval = c("none", "confidence", "prediction"),
level = 0.95, type = c("response", "terms"),
terms = NULL, na.action = na.pass,
pred.var = res.var/weights, weights = 1, ...)
Now, what I've trouble understanding:
1) newdata
An optional data frame in which to look for variables
with which to predict. If omitted, the fitted values are used.
Everyone seems to use newdata for this, but I cannot quite understand why. For calculating the confidence interval I obviously need the data which this interval is for (like the # of observations, mean of x etc), so cannot be what is meant by it. But then: What is does it mean?
2) interval
Type of interval calculation.
okay.. but what is "none" for?
3a) type
Type of prediction (response or model term).
3b) terms
If type="terms", which terms (default is all terms)
3a: Can I by that get the confidence interval for one specific variable in my model? And if so, what is 3b for then? If I can specify the term in 3a, it wouldn't make sense to do it in 3b again.. so I guess I'm wrong again, but I cannot figure out why.
I guess some of you might think: Why don't just try this out? And I would (even if it would maybe not solve everything here), but I right now don't know how to. As I do not now what the newdata is for, I don't know how to use it and if I try, I do not get the right confidence interval. Somehow it is very important how you choose that data, but I just don't understand!
EDIT: I want to add that my intention is to understand how predict.lm works. By that I mean I don't understand if it works the way I think it does. That is it calculates y-hat (predicted values) and than uses adds/subtracts for each the upr/lwr-bounds of the interval to calculate several datapoints(looking like a confidence-line then) ?? Then I would undestand why it is necessary to have the same lenght in the newdata as in the linear model.
Make up some data:
d <- data.frame(x=c(1,4,5,7),
y=c(0.8,4.2,4.7,8))
Fit the model:
lm1 <- lm(y~x,data=d)
Confidence and prediction intervals with the original x values:
p_conf1 <- predict(lm1,interval="confidence")
p_pred1 <- predict(lm1,interval="prediction")
Conf. and pred. intervals with new x values (extrapolation and more finely/evenly spaced than original data):
nd <- data.frame(x=seq(0,8,length=51))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
p_pred2 <- predict(lm1,interval="prediction",newdata=nd)
Plotting everything together:
par(las=1,bty="l") ## cosmetics
plot(y~x,data=d,ylim=c(-5,12),xlim=c(0,8)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(d$x,p_pred1[,c("lwr","upr")],col=2,lty=2,type="b",pch=1)
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
matlines(nd$x,p_pred2[,c("lwr","upr")],col=4,lty=2,type="b",pch=1)
Using new data allows for extrapolation beyond the original data; also, if the original data are sparsely or unevenly spaced, the prediction intervals (which are not straight lines) may not be well approximated by linear interpolation between the original x values ...
I'm not quite sure what you mean by the "confidence interval for one specific variable in my model"; if you want confidence intervals on a parameter, then you should use confint. If you want predictions for the changes based only on some of the parameters changing (ignoring the uncertainty due to the other parameters), then you do indeed want to use type="terms".
interval="none" (the default) just tells R not to bother computing any confidence or prediction intervals, and to return just the predicted values.

Resources