I'm trying to fit a generalized additive model with a binary response, using the code:
library(mgcv)
m = gam(y~s(x1)+s(x2), family=multinom(K=2), data=mydata)
Below is a part of my data (total sample size is 443) :
mydata[1:3,]
y x1 x2
1 1 12.55127 0.2553079
2 1 12.52029 0.2264185
3 0 12.53868 0.2183521
But I receive this error:
Error in offset[[i]] : attempt to select less than one element
What is wrong with my code?
First of all, for binary response, why not use family = binomial()?
Secondly, if you want to test multinom, set K = 1, because categories are coded from 0 to K. See ?multinom. However, you need to pass in a list of model formulae, for multinom family. Even if K = 1, you would need a length-1 list. Use list(y ~ s(x1) + s(x2)).
Related
I am using the phylolm function (in package phylolm) to conduct PGLS phylogenetic analysis and am having some trouble interpreting the model output.
I am running a phylolm model with a continuous (log transformed) response variable and one predictor variable which is a factor with two groups. When I change the reference group (from condition A to B) and rerun the same model, the estimates change accordingly but the standard errors do not seem to. The standard error for the new reference group remains very high - high enough that I don't see how the difference between groups can be significant (which the p value indicates they are). I was under the impression that phylolm standard errors can be interpreted in the same was as for ordinary linear regression - am I mistaken?
Since you have a binary variable, changing the reference category should only reverse the value of the beta estimate - should it not? This is what happens in your models.
It might help to think of the coefficient for the conditions as the different between the group means. The difference between the means will be the same size, but the sign will change depending if you are comparing condition A to condition B or condition B to condition A.
# Create a binary variable
x = sample(c(0,1), n, replace = TRUE)
# and create the opposite variable (ie. changing the reference level)
x.rev = +(!x)
# add some error to the model
error = runif(n)
# create the continuous response variablbe
y = 2 + 2 * x + error
df = data.frame(y, x)
# look at the group means between each condition
group_means = tapply(y, x, mean)
group_means[1] - group_means[2]
group_means[1] - group_means[2]
# compare those to the coefficients in the model
summary(lm(y ~ x))
summary(lm(y ~ x.rev))
I want to use a logistic regression to actually perform regression and not classification.
My response variable is numeric between 0 and 1 and not categorical. This response variable is not related to any kind of binomial process. In particular, there is no "success", no "number of trials", etc. It is simply a real variable taking values between 0 and 1 depending on circumstances.
Here is a minimal example to illustrate what I want to achieve
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10))
fit <- glm(formula = resp ~ a + b,
family = "binomial",
data = dummy_data)
This code gives a warning then fails because I am trying to fit the "wrong kind" of data:
In eval(family$initialize) : non-integer #successes in a binomial glm!
Yet I think there must be a way since the help of family says:
For the binomial and quasibinomial families the response can be
specified in one of three ways: [...] (2) As a numerical vector with
values between 0 and 1, interpreted as the proportion of successful
cases (with the total number of cases given by the weights).
Somehow the same code works using "quasibinomial" as the family which makes me think there may be a way to make it work with a binomial glm.
I understand the likelihood is derived with the assumption that $y_i$ is in ${0, 1}$ but, looking at the maths, it seems like the log-likelihood still makes sense with $y_i$ in $[0, 1]$. Am I wrong?
This is because you are using the binomial family and giving the wrong output. Since the family chosen is binomial, this means that the outcome has to be either 0 or 1, not the probability value.
This code works fine, because the response is either 0 or 1.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = sample(c(0,1),10,replace=T,prob=c(.5,.5)) )
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data)
If you want to model the probability directly you should include an additional column with the total number of cases. In this case the probability you want to model is interpreted as the success rate given the number of case in the weights column.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data, weights = w)
You will still get the warning message, but you can ignore it, given these conditions:
resp is the proportion of 1's in n trials.
for each value in resp, the corresponding value in w is the number of trials.
From the discussion at Warning: non-integer #successes in a binomial glm! (survey packages), I think we can solve it by another family function ?quasibinomial().
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit2 <- glm(formula = resp ~ a + b,
family = quasibinomial(),
data = dummy_data, weights = w)
If I use the lme function in the package nlme and write
m <- lme(y ~ Time, random = ~1|Subject)
and then write
Variogram(m, form = ~Time|Subject)
it produces the variogram no problem.
However, if I use lm without the random effect,
m <- lm(y ~ Time)
and write
Variogram(m, form = ~Time)
it produces
Error in Variogram.default(m, form = ~Time) :
argument "distance" is missing, with no default
What's going on? Why does it need a distance when I fit a lm, when it didn't need it before with lme?
How then does one plot a Variogram without needing to specify "Distance"? I have the same problem using other modelling methods: glm, gam, gamm, etc.
EDIT:
You can verify all of this yourself using e.g. the BodyWeight data in nlme.
> m <- lm(weight ~ Time, data = BodyWeight)
> Variogram(m, form =~Time)
Error in Variogram.default(m, form = ~Time) :
argument "distance" is missing, with no default
In nlme there is a Variogram.lme method function for an lme fit, but there is not an equivalent method for lm models.
You can use Variogram.default as follows:
library(nlme)
mod1 <- lm(weight ~ Time, data = BodyWeight)
n <- nrow(BodyWeight)
variog <- Variogram(resid(mod1), distance=dist(1:n))
head(variog)
############
variog dist
1 17.4062805 1
2 23.1229516 2
3 29.6500135 3
4 15.6848617 4
5 3.1222878 5
6 0.9818238 6
We can also plot the variogram:
plot(variog)
I'm trying to do a simulation-based power analysis for a GLLM, like the one described here.
My model object is fitted with glmer using a Gamma distribution with log link:
fit.gamma = glmer(time ~ display + root + technique + order + (1|user), data=Static, family = Gamma(link="log"))
The simulation of the responses works well when I supply the fitted model object:
simulate(fit.gamma, nsim=1, newdata=expdat)
Result looks good:
sim_1
1 1.8300779
2 12.8543403
3 30.7541107
4 8.2714460
5 162.2040545
But I get a warning and wrong output when, instead, I inform the model formula and the parameters of the model:
newparams <- list(beta = fixef(fit.gamma), theta = getME(fit.gamma, "theta"),
sigma = getME(fit.gamma, "sigma"))
simulate(~ display + root + technique + order + (1|user),
nsim=1,
family=Gamma(link="log"),
newdata=expdat,
newparams=newparams,
allow.new.levels=T)
I get the following warning:
Warning message:
In rgamma(nsim * length(ftd), shape = shape, rate = shape/ftd) :
NAs produced
And the response vector looks like this:
sim_1
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
The problem can be reproduced with the following code:
dd <- data.frame(x=rep(seq(-2,2,length=15),10), f=factor(rep(1:10,each=15)));
dd$y <- simulate(~x+(1|f), family=Gamma(link="log"), seed=101, newdata=dd, newparams=list(beta = c(0,2), theta = 1, sigma = 1))[[1]]
Working with the formula would be better cause I would then be able to manipulate the parameters to simulate various effect sizes.
The function simulate is supposed to work with either model object or formula. Am I missing something? Is there an argument that I'm failing to pass to the function?
I'd like to use the ols() (ordinary least squares) function from the rms package to do a multivariate linear regression, but I would not like it to calculate the intercept. Using lm() the syntax would be like:
model <- lm(formula = z ~ 0 + x + y, data = myData)
where the 0 stops it from calculating an intercept, and only two coefficients are returned, on for x and the other for y. How do I do this when using ols()?
Trying
model <- ols(formula = z ~ 0 + x + y, data = myData)
did not work, it still returns an intercept and a coefficient each for x and y.
Here is a link to a csv file
It has five columns. For this example, can only use the first three columns:
model <- ols(formula = CorrEn ~ intEn_anti_ncp + intEn_par_ncp, data = ccd)
Thanks!
rms::ols uses rms:::Design instead of model.frame.default. Design is called with the default of intercept = 1, so there is no (obvious) way to specify that there is no intercept. I assume there is a good reason for this, but you can try changing ols using trace.