I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")
Related
I'm hoping to get some help with presenting regression outputs for my Masters thesis. I am assessing the impacts of elephants on woody vegetation, particularly in relation to artificial waterholes. In addition to generally declining with distance from waterholes, the impacts differ substantially between the two vegetation types involved.
I've figured out what seems to me a satisfactory way to of plotting this using visreg. In the model output shown below, both distance to waterhole and veg type explained damage, hence my attempt to show both. However, the issue is that I only have samples at the furthest distances for waterholes (x-axis) from the red vegetation type. As you can see, the regression line for the blue veg type is extending beyond the last points for this vegetation type. Is there anyway I can get the blue line to stop at a smaller distance from the waterhole (x axis value) than for the red to avoid this?
See code for the model and plot below the visreg plot.
Sample data and code
> dput(vegdata[21:52, c(4,7,33)])
structure(list(distance = c(207L, 202L, 501L, 502L, 1001L, 1004L,
2010L, 1997L, 4003L, 3998L, 202L, 194L, 499L, 494L, 1004L, 1000L,
2008L, 1993L, 4008L, 3998L, 493L, 992L, 1941L, 2525L, 485L, 978L,
1941L, 3024L, 495L, 978L, 1977L, 2952L), vegtype = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("teak",
"term"), class = "factor"), toedl = c(35.48031025, 47.30482718,
25.16709533, 22.29360164, 17.6546533, 12.81605101, 20.34136734,
18.45809334, 11.3578081, 3.490830751, 60.54870317, 44.9863128,
18.81010698, 20.4777188, 30.36994386, 18.7417214, 21.52247156,
18.29685939, 30.26217664, 8.945486104, 43.95749178, 43.54799495,
44.42693993, 50.06207783, 48.05538594, 35.31220933, 52.37339094,
40.51569938, 41.45677007, 58.86629306, 37.80203313, 46.35633342
)), row.names = 21:52, class = "data.frame")
m1<-lm(toedl~vegtype+distance, data=vegdata)
summary(m1)
library(visreg)
visreg(oedl6, 'sexactd', by='vegtype',overlay=TRUE, gg=TRUE, points=list(size=2.5), ylab='% old elephant damage', xlab='distance from waterhole')
Regarding the comments about a reproducible example, you can just make a small dataframe with representative data like below, also a general comment that you should avoid naming your variables names of base functions like 'all'.
I'm not sure whether it's possible to use visreg to do what you want, but you can extract the information from your model using predict, then use ggplot to plot it, which may be preferable because ggplot is really good for customizing plots.
library(ggplot2)
library(visreg)
# Create reproducible data example
allData <- data.frame(vegtype = rep(c("t1", "t2"), each = 10),
oedl = c(seq(from = 35, to = 20, length.out = 10),
seq(from = 20, to = 5, length.out = 10)),
sexactd = c(seq(from = -1, to = 1, length.out = 10),
seq(from = -1, to = 2, length.out = 10)))
# Make linear model
oedl6 <- lm(formula = oedl ~ sexactd + vegtype, data = allData)
# Predict the data using the linear model
odelPred <- cbind(allData, predict(oedl6, interval = 'confidence'))
ggplot(odelPred, aes(sexactd, oedl, color = vegtype, fill = vegtype)) +
geom_point() + geom_line(aes(sexactd, fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3)
MR Macarthurs solution is great, and (s)he deserved the accepted answer. Visualising a multiple regression model with several predictors in a 2 dimensional graph is... difficult. Basically, you are limited to one predictor. And can add the interaction (in your case: vegtype). One can simply use
geom_smooth for it.
Using your data:
library(tidyverse)
ggplot(vegdata, aes(toedl, distance, color = vegtype)) +
geom_point() +
geom_smooth(method = 'lm')
Created on 2019-12-13 by the reprex package (v0.3.0)
I am struggling examining significant interactions from a linear regression with user defined contrast codes. I conducted an intervention to improve grades with three treatment groups (i.e., mindset, value, mindset plus value) and one control group and would now like to see if there is an interaction between specific intervention levels and various theoretically relevant categorical variables such as gender, free lunch status, and a binary indicator for having a below average gpa the two previous semesters, and relevant continuous variables such as pre-intervention beliefs (Imp_pre.C & Val_pre.C in the sample dataframe below). The continuous moderators have been mean centered.
#subset of dataframe
mydata <- structure(list(cond = structure(c(1L, 2L, 3L, 3L, 2L, 3L, 1L,2L, 2L, 1L), contrasts = structure(c(-0.25, 0.0833333333333333,0.0833333333333333, 0.0833333333333333, -1.85037170770859e-17,-0.166666666666667, -0.166666666666667, 0.333333333333333, 0,-0.5, 0.5, 0), .Dim = 4:3, .Dimnames = list(c("control", "mindset","value", "MindsetValue"), NULL)), .Label = c("control", "mindset","value", "MindsetValue"), class = "factor"), sis_mp3_gpa = c(89.0557142857142,91.7514285714285, 94.8975, 87.05875, 69.9928571428571, 78.0357142857142,87.7328571428571, 83.8925, 61.2271428571428, 79.8314285714285), sis_female = c(1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L), low_gpa_m1m2 = c(0,0, 0, 0, 1, 1, 0, 0, 1, 0), sis_frpl = c(0L, 1L, 0L, 0L, 1L,1L, 1L, 1L, NA, 0L), Imp_pre.C = c(1.80112944983819, -0.198870550161812,1.46112944983819, -0.198870550161812, 0.131129449838188, NA,-0.538870550161812, 0.131129449838188, -0.198870550161812, -0.198870550161812), Val_pre.C = c(-2.2458357581636, -2.0458357581636, 0.554164241836405,0.554164241836405, -0.245835758163595, NA, 0.554164241836405,0.554164241836405, -2.0458357581636, -1.64583575816359)), row.names = c(323L,2141L, 2659L, 2532L, 408L, 179L, 747L, 2030L, 2183L, 733L), class = "data.frame")
Since I'm only interested in specific contrasts, I created the following user defined contrast codes.
mat = matrix(c(1/4, 1/4, 1/4, 1/4, -3, 1, 1, 1, 0, -1, -1, 2, 0,
mymat = solve(t(mat))
mymat
my.contrasts <- mymat[,2:4]
contrasts(mydata$cond) = my.contrasts
I'm testing the following model:
#model
model1 <- lm(sis_mp3_gpa ~ cond +
sis_female*cond +
low_gpa_m1m2*cond +
sis_frpl*cond +
Imp_pre.C*cond +
Val_pre.C*cond +
Imp_pre.C + Val_pre.C +
low_gpa_m1m2 + sis_female +
sis_frpl , data = mydata)
summary(model1)
In the full data set there is a significant interaction between contrast two (comparing mindset plus value to mindset & value) and previous value beliefs (i.e., Val_pre.C) and between this contrast code and having a low gpa the 2 previous semesters. To interpret significant interactions I would like to generate predicted values for individuals one standard deviation below and above the mean on the continuous moderator and the two levels of the categorical moderator. My problem is that when I have tried to do this the plots contain predicted values for each condition rather than collapsing the mindset and value condition and excluding the control condition. Also, when I try to do simple slope analyses I can only get pairwise comparisons for all four conditions rather than the contrasts I'm interested in.
How can I plot predicted values and conduct simple slope analyses for my contrast codes and a categorical and continuous moderator?
This question cannot be answered for two reasons: (1) the code for mat is incomplete and won't run, and (2) the model seriously over-fits the subsetted data provided, resulting in 0 d.f. for error and no way of testing anything.
However, here is an analysis of a much simpler model that may show how to do what is needed.
> simp = lm(sis_mp3_gpa ~ cond * Imp_pre.C, data = mydata)
> library(emmeans)
> emt = emtrends(simp, "cond", var = "Imp_pre.C")
> emt # estimated slopes
cond Imp_pre.C.trend SE df lower.CL upper.CL
control 1.97 7.91 3 -23.2 27.1
mindset 1.37 42.85 3 -135.0 137.7
value 4.72 12.05 3 -33.6 43.1
Confidence level used: 0.95
> # custom contrasts of those slopes
> contrast(emt, list(c1 = c(-1,1,0), c2 = c(-1,-1,2)))
contrast estimate SE df t.ratio p.value
c1 -0.592 43.6 3 -0.014 0.9900
c2 6.104 49.8 3 0.123 0.9102
I have the following long dataset (R dataframe called "dat"). It has around 10,000 observations of 1200 children, with a maximum of 10 observations per child. These observations are weight of the child collected at different time points in early childhood, starting from birth until age 5 years. Age in the dataset is given in days (variable = agedays) and weight in kilograms in variable wtkg.
Dataset
structure(list(subjid = c(1001L, 1001L, 1001L, 1001L), sex = structure(c(2L, 2L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"),
agedays = c(0L, 2L, 30L, 107L), wtkg = c(3.78, 3.64, 4.71,
6.5), Uobservations = c(10L, 10L, 10L, 10L), BMI_group = structure(c(2L,
2L, 2L, 2L), .Label = c("normal", "obese", "overweight",
"true_NA", "underweight"), class = "factor"), GWG_T3_cat2 = structure(c(2L,
2L, 2L, 2L), .Label = c("T3notexcessive", "T3excessive"), class = "factor")), .Names = c("subjid", "sex", "agedays", "wtkg", "Uobservations", "BMI_group", "GWG_T3_cat2"), row.names = c(NA, 4L), class = "data.frame")
I want to use the SITAR package to study the velocity, tempo and size of the children according to excessive gestational weight gain in trimester 3 (versus not excessive weight gain in trimester 3 ) (variable = GWG_T3_cat2) among obese mothers (variable = BMI_group).
I tried running a model as seen in this link: https://www.rdocumentation.org/packages/sitar/versions/1.0.9
m1 <- sitar(x=agedays, y=wtkg, id=subjid, data=dat, df=2)
But I get an error:
Error in nlme.formula(y ~ fitnlme(x, s1, s2, a, b, c), fixed = s1 + s2 +:
Singularity in backsolve at level 0, block 1
I would really appreciate if any one could help solve this issue.
I am wishing to run a linear mixed model on a dependent variable DV that is collected under two different Condition at three different Timepoint. The data is structured as follows:
## dput(head(RawData,5))
structure(list(Participant = structure(c(2L, 2L, 2L, 2L, 4L),
.Label = c("Jessie", "James", "Gus", "Hudson", "Flossy",
"Bobby", "Thomas", "Alfie", "Charles", "Will", "Mat", "Paul", "Tim",
"John", "Toby", "Blair"), class = "factor"),
xVarCondition = c(1, 1, 0, 0, 1),
Measure = structure(c(1L, 2L, 3L, 4L, 1L),
.Label = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12"), class = "factor"),
Sample = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("1", "2"), class = "factor"),
Condition = structure(c(2L, 2L, 1L, 1L, 2L),
.Label = c("AM", "PM"), class = "factor"),
Timepoint = structure(c(2L, 2L, 2L, 2L, 1L),
.Label = c("Baseline", "Mid", "Post"), class = "factor"),
DV = c(83.6381348645853, 86.9813802115179, 69.2691666620429,
71.3949807856125, 87.8931998204771)),
.Names = c("Participant", "xVarCondition", "Measure",
"Sample", "Condition", "Timepoint", "DV"),
row.names = c(NA, 5L), class = "data.frame")
Each Participant performs two trials per Condition across three Timepoints as depicted by Measure; however, there are missing data so not necessarily 12 levels per participant. The column xVarCondition is simply a dummy variable that includes a 1 for each entry of AM in Condition. The column Sample refers to the 2 trials for each Condition at each Timepoint.
I am an R user but the statistician is a SAS user who believes the code for the model should be:
proc mixed data=RawData covtest cl alpha=α
class Participant Condition Timepoint Measure Sample;
model &dep=Condition Timepoint/s ddfm=sat outp=pred residual noint;
random int xVarCondition xVarCondition*TimePoint*Sample
TimePoint/subject=Participant s;
The above SAS code gives sensible answers and is working perfectly. We believe the resulting lme4 syntax for the above model to be:
TestModel = lmer(DV ~ Condition + Timepoint +
(1 | Participant/Timepoint) +
(0 + xVarCondition | Participant) +
(1 | Participant:xVarCondition:Measure), data = RawData)
However, I get the following error when running this model:
Error: number of levels of each grouping factor must be < number of observations
Are the random effects specified correctly?
I can't quite tell from your description, but most likely your Participant:xVarCondition:Measure term constructs a grouping variable that has no more than one more observation in each level of classification, which will make the (1|Participant:xVarCondition:Measure) term redundant with the residual error term which is always included in an lmer model. You can override the error if you really want to by including
control=lmerControl(check.nobs.vs.nlev = "ignore")
in your function call, but (if I've diagnosed the problem correctly) this will lead to the residual variance and the Participant:xVarCondition:Measure variance being jointly unidentifiable. Such unidentifiability usually doesn't cause any problems with the rest of the model, but I am more comfortable with an identifiable model (there's always the possibility that such unidentifiability will lead to numerical issues).
There's a similar example here.
You can check my conjecture as follows:
ifac <- with(RawData,
interaction(Participant,xVarCondition,Measure,drop=TRUE))
length(levels(ifac)) == nrow(RawData)
I am running a GLMM using glmer() in R:
glmer(survive ~ fyear + site + fyear * site.x + (1|fyear),
family = binomial(link = logexp(shaffer.sub$exposure)),
data = shaffer.sub)
where survive is 0 or 1 depending if the nest was successful or not. Here you can see what the data looks like:
structure(list(id = structure(1:7, .Label = c("1", "2", "3",
"4", "5", "6", "7"), class = "factor"), year.x = structure(c(1L,
1L, 2L, 3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1999"), class = "factor"),
survive = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), fyear = structure(c(1L, 1L, 2L,
3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1999"), class = "factor"),
site.x = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 1L), .Label = c("N",
"S"), class = "factor")), .Names = c("id", "year.x", "survive",
"fyear", "site.x"), row.names = c(NA, -7L), class = "data.frame")
but I get this warning message:
*Warning messages:
1: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0299425 (tol = 0.001, component 12)
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?*
I was told I should not use the same random factor as a fixed effect on the same model.
At the end, I would like to have an output where I can see year, site and the interaction year:site effects. Like in an ANOVA table (is this possible? I've been trying to use summary(aov(model)) but this doesn't work; anova(model) does not either.
I get this error for the aov() command:
*Error in summary`(aov(syearXsite))` :
error in evaluating the argument 'object' in selecting a method for function 'summary': Error in if (fixed.only) { : argument is not interpretable as logical*
How can I see the effect of this variables on survival?
Whoever told you not to use a categorical input variable (fyear) as both a fixed and a random effect was correct. It's hard to know exactly what to recommend, it depends on the number of years and sites you have in your data set (is the data you linked to all of your data (I hope not), or just the first few rows? How many years and how many sites and how many total observations do you have?)
If you want to treat year as random and site as fixed (which would be sensible if you have only two sites (N vs S as seen in your data) and quite a few years, e.g. more than 5) then you could fit:
g1 <- glmer(survive~site.x+(site.x|fyear),
family=binomial(link=logexp(shaffer.sub$exposure)),
data=shaffer.sub)
I don't know what site vs site.x are: I only see site.x in your data snippet.
To get information, try summary(g1). (That will only give you variances for random effects, not for fixed effects; GLMMs don't operate in the same "variance explained" mode as ANOVA does, in particular because the variances explained by different terms usually do not add up to the total variance.)