I am using phylolm package to run a model with binary response variable (0/1), and a continuous predictor and a categorical predictor having more than 3 levels. If I consider the categorical predictor as continuous, i.e., 0, 1, 2, 3, the model ran well and I can use summary(model) to obtain the model output. However, this kind of treatment (treating categorical levels as continuous) does not fit the reality, I think it will be right to consider them as category, in this way, The model worked, but I had the problem to obtain the model output, as when I used summary(model), it gave the results for each category compared to the first. I would like to have an "anova" kind of table to summarize the significance of each variable, however, anova function does not apply to this kind of analysis. I want to ask is there any way to obtain the results for this model?
Some example scripts:
require(phylolm)
set.seed(123456)
# Simulate a tree of 50 species
tre = rtree(50)
# Simulate a continuous trait
conTrait = rTrait(n=1,phy=tre)
# Make a design matrix for the binary trait simulation
X = cbind(rep(1,50),conTrait)
# Simulate a binary trait
binTrait = rbinTrait(n=1,phy=tre, beta=c(-1,0.5), alpha=1 ,X=X)
# Simulate a radom categorical trait
catTrait <-
as.factor(sample(c("A","B","C"),size=length(tre$tip.label),replace=TRUE))
# Create data frame
dat = data.frame(binTrait = binTrait, conTrait = conTrait, catTrait = catTrait)
### run the model
fit = phyloglm(binTrait ~ conTrait*catTrait, phy=tre, data=dat)
##model output
summary(fit)
Call:
phyloglm(formula = binTrait ~ conTrait * catTrait, data = dat,
phy = tre)
AIC logLik Pen.logLik
52.07 -19.04 -17.28
Method: logistic_MPLE
Mean tip height: 3.596271
Parameter estimate(s):
alpha: 1.437638
Coefficients:
Estimate StdErr z.value p.value
(Intercept) -0.61804 0.83270 -0.7422 0.4580
conTrait 1.52295 1.16256 1.3100 0.1902
catTraitB 0.92563 0.98812 0.9368 0.3489
catTraitC -0.24900 1.01255 -0.2459 0.8057
conTrait:catTraitB 0.49031 1.41858 0.3456 0.7296
conTrait:catTraitC -0.74376 1.29850 -0.5728 0.5668
Note: Wald-type p-values for coefficients, conditional on alpha=1.437638
Related
I have a dataset demos_mn of demographics and an outcome variable. There are 5 variables of interest, so that my glm and null models looks like this:
# binomial model
res.binom <- glm(var.bool ~ var1 + var2*var3 + var4 + var5,
data = demos_mn, family = "binomial")
# null model
res.null <- glm(var.bool ~ 1,
data = demos_mn, family = "binomial")
# calculate marginal R2
print(r.squaredGLMM(res.binom))
# show p value
print(anova(res.null, res.binom))
That is my work flow for glm mixed models, but for my binomial model I do not get a p-value for the overall model only for the predictors. I'm hoping someone could enlighten me?
I did have some success using glmer for a repeated measures version of the model, however that unfortunately means I had to get rid of some key variables that were not measured repeatedly.
Perhaps you forgot test="Chisq" ? From ?anova.glm:
test: a character string, (partially) matching one of ‘"Chisq"’,
‘"LRT"’, ‘"Rao"’, ‘"F"’ or ‘"Cp"’. See ‘stat.anova’.
example("glm") ## to set up / fit the glm.D93 model
null <- update(glm.D93, . ~ 1)
anova(glm.D93, null, test="Chisq")
Analysis of Deviance Table
Model 1: counts ~ outcome + treatment
Model 2: counts ~ 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 4 5.1291
2 8 10.5814 -4 -5.4523 0.244
test="Chisq" is poorly named: it's a likelihood ratio test, note it's an asymptotic test [relies on a large sample size]. For GLMs with an adjustable scale parameter (Gaussian, Gamma, quasi-likelihood) you would use test="F".
I am conducting a lasso regression modeling predictors of a count outcome in glmnet.
I am wondering what to make of the predictions from this model.
Here is some toy data. It's not very good because I don't know how to simulate multivariate data but I'm mainly interested in whether I'm getting the syntax right.
set.seed(123)
df <- data.frame(count = rpois(500, lambda = 3),
pred1 = rnorm(500),
pred2 = rnorm(500),
pred3 = rnorm(500),
pred4 = rnorm(500),
pred5 = rnorm(500),
pred6 = rnorm(500),
pred7 = rnorm(500),
pred8 = rnorm(500),
pred9 = rnorm(500),
pred10 = rnorm(500))
Now run the model
x <- model.matrix(count ~ ., df)[,-1]
y <- df$count
cvg <- cv.glmnet(x,y,family = "poisson")
now when I generate predicted outcomes
yTest <- predict(cvg, newx = x, family = "poisson", type = "link")
This is the output
# 1 1.094604
# 2 1.094604
# 3 1.094604
# 4 1.094604
# 5 1.094604
# 6 1.094604
# ... ........
Now obviously the model predictions are all the same and all terrible (unsurprising given the absence of any association between the predictors and the outcome), but the thing I am wondering is why they are not integers (with my real data I have the same problem).
I have several questions.
So my questions are:
Am I specifying the correct arguments in the glmnet.predict() function? In the help for the predict function it states that specifying type = "link" gives "the linear predictors" for poisson models, whereas specifying type = "response" gives the "fitted mean" for poisson models (in the case of my dumb example it generates 500 values of 2.988).
Shouldn't the predicted outcomes match the form of the data itself, i.e. be integers?
If I am specifying the correct arguments in the predict() function, how do I use the non-integer predictions Do I round them to the nearest integer, or just leave them alone?
Shouldn't the predicted outcomes match the form of the data itself,
i.e. be integers?
When you use a regression model you are associating a (conditional) probability distribution, indexed by parameters (in the Poisson case, the lambda parameter, which represents the mean) to each predictor configuration. A prediction of the response minimizes some expected loss function conditional to the predictor values so it depends on what loss function you are using.
If you consider a 0-1 loss, then yes, the predicted values should be an integer: the mode of the distribution, its most probable value, which in the case of a Possion distribution is the floor of lambda if it is not an integer (https://en.wikipedia.org/wiki/Poisson_distribution).
If you consider a squared loss (y - y_prediction)^2 then your prediction is the conditional expectation (see https://en.wikipedia.org/wiki/Minimum_mean_square_error#Properties), which is not necessarily an integer, just like the result you are getting.
glmnet uses squared loss, but you can easily predict an integer value (the one that minimizes the 0-1 loss) by applying the floor() function to the predicted values output by glmnet.
I am analysing routinely collected substance use data during the first 12 months' of treatment in a large sample of outpatients attending drug and alcohol treatment services. I am interested in whether differing levels of methamphetamine use (no use, low use, and high use) at the outset of treatment predicts different levels after a year in treatment, but the data is very irregular, with different clients measured at different times and different numbers of times during their year of treatment.
The data for the high and low use group seem to suggest that drug use at outset reduces during the first 3 months of treatment and then asymptotes. Hence I thought I would try a non-linear exponential decay model.
I started with the following nonlinear generalised least squares model using the gnls() function in the nlme package:
fitExp <- gnls(outcome ~ C*exp(-k*yearsFromStart),
params = list(C ~ atsBase_fac, k ~ atsBase_fac),
data = dfNL,
start = list(C = c(nsC[1], lsC[1], hsC[1]),
k = c(nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = gnlsControl(nlsTol = 0.1))
where outcome is number of days of drug use in the 28 days previous to measurement, atsBase_fac is a three-level categorical predictor indicating level of amphetamine use at baseline (noUse, lowUse, and highUse), yearsFromStart is a continuous predictor indicating time from start of treatment in years (baseline = 0, max - 1), C is a parameter indicating initial level of drug use, and k is the rate of decay in drug use. The starting values of C and k are taken from nls models estimating these parameters for each group. These are the results of that model
Generalized nonlinear least squares fit
Model: outcome ~ C * exp(-k * yearsFromStart)
Data: dfNL
AIC BIC logLik
27672.17 27725.29 -13828.08
Variance function:
Structure: Exponential of variance covariate
Formula: ~yearsFromStart
Parameter estimates:
expon
0.7927517
Coefficients:
Value Std.Error t-value p-value
C.(Intercept) 0.130410 0.0411728 3.16738 0.0015
C.atsBase_faclow 3.409828 0.1249553 27.28839 0.0000
C.atsBase_fachigh 20.574833 0.3122500 65.89218 0.0000
k.(Intercept) -1.667870 0.5841222 -2.85534 0.0043
k.atsBase_faclow 2.481850 0.6110666 4.06150 0.0000
k.atsBase_fachigh 9.485155 0.7175471 13.21886 0.0000
So it looks as if there are differences between groups in initial rate of drug use and in rate of reduction in drug use. I would like to go a step further and fit a nonlinear mixed effects model.I tried consulting Pinhiero and Bates' book accompanying the nlme package but the only models I could find that used irregular, sparse data like mine used a self-starting function, and my model does not do that.
I tried to adapt the gnls() model to nlme like so:
fitNLME <- nlme(model = outcome ~ C*exp(-k*yearsFromStart),
data = dfNL,
fixed = list(C ~ atsBase_fac, k ~ atsBase_fac),
random = pdDiag(yearsFromStart ~ id),
groups = ~ id,
start = list(fixed = c(nsC[1], lsC[1], hsC[1], nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = nlmeControl(optim = "optimizer"))
bit I keep getting error message, I presume through errors in the syntax specifying the random effects.
Can anyone give me some tips on how the syntax for the random effects works in nlme?
The only dataset in Pinhiero and Bates that resembled mine used a diagonal variance-covariance matrix. Can anyone filled me in on the syntax of this nlme function, or suggest a better one?
p.s. I wish I could provide a reproducible example but coming up with synthetic data that re-creates the same errors is way beyond my skills.
My data includes survey data of car buyers. My data has a weight column that i used in SPSS to get sample sizes. Weight column is affected by demographic factors & vehicle sales. Now i am trying to put together a logistic regression model for a car segment which includes a few vehicles. I want to use the weight column in the logistic regression model & i tried to do so using "weights" in glm function. But the results are horrific. Deviances are too high, McFadden Rsquare too low. My dependent variable is binary, independent variables are on 1 to 5 scale. Weight column is numerical, ranging from 32 to 197. Could that be a reason that results are poor? Do i need to have values in weight column below 1?
Format of input file to R is -
WGT output I1 I2 I3 I4 I5
67 1 1 3 1 5 4
I1, I2, I3 being independent variables
logr<-glm(output~1,data=data1,weights=WGT,family="binomial")
logrstep<-step(logr,direction = "both",scope = formula(data1))\
logr1<-glm(output~ (formula from final iteration),weights = WGT,data=data1,family="binomial")
hl <- hoslem.test(data1$output,fitted(logr1),g=10)
I want a logistic regression model with better accuracy & gain a better understanding of using weights with logistic regression
I would check out the survey package. This will allow you to specify weights for the survey design using the svydesign function. Additionally, you can use the svyglm function to perform your weighted logistic regression. See http://r-survey.r-forge.r-project.org/survey/
Something like the following assuming your data is in a dataframe called df
my_svy <- svydesign(df, ids = ~1, weights = ~WGT)
Then you can do the following:
my_fit <- svyglm(output ~1, my_svy, family = "binomial")
For a full reprex check out the below example
library(survey)
# Generate Some Random Weights
mtcars$wts <- rnorm(nrow(mtcars), 50, 5)
# Make vs a factor just for illustrative purposes
mtcars$vs <- as.factor(mtcars$vs)
# Build the Complete survey Object
svy_df <- svydesign(data = mtcars, ids = ~1, weights = ~wts)
# Fit the logistic regression
fit <- svyglm(vs ~ gear + disp, svy_df, family = "binomial")
# Store the summary object
(fit_sumz <- summary(fit))
# Look at the AIC if desired
AIC(fit)
# Pull out the deviance if desired
fit_sumz$deviance
As far as the stepwise regression, this typically isn't a great methodology for a statistical point of view. It results in a higher R2 and some other issues regarding inference (see https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/).
I'm struggling to plot the cofficients of an glm model using abline. Lets take this simple 2D example:
d <- iris[51:150, c(3:4,5)]
d[,3] <- factor(d[,3])
plot(d[,1:2], col=d[,3])
The glm model yields 4 coefficients:
m <- glm(formula = Species~Petal.Length*Petal.Width, data = d, family = "binomial")
m$coefficients
# (Intercept) Petal.Length Petal.Width Petal.Length:Petal.Width
# -131.23813 22.93553 63.63527 -10.63606
How to plot those with a simple abline?
Binomial models are usually not set up like this. You usually will have a single 0|1 response variable (i.e. predict whether a sample in a single species). Maybe because you only have 2 species included in your model, it still seems to work (this is not that case when all 3 spp are included).
The second trick is to predict type="response" and round these values to get discrete predictions:
d$pred <- factor(levels(d[,3])[round(predict(m, type="response"))+1])
plot(d[,1:2], col=d[,3])
points(d[,1:2], col=d$pred, pch=4)
here I've added an "X" for the predictions. If color is the same, then the prediction was correct. I count 5 samples where the prediction was incorrect.