Probe interactions from regression with user defined contrasts - r

I am struggling examining significant interactions from a linear regression with user defined contrast codes. I conducted an intervention to improve grades with three treatment groups (i.e., mindset, value, mindset plus value) and one control group and would now like to see if there is an interaction between specific intervention levels and various theoretically relevant categorical variables such as gender, free lunch status, and a binary indicator for having a below average gpa the two previous semesters, and relevant continuous variables such as pre-intervention beliefs (Imp_pre.C & Val_pre.C in the sample dataframe below). The continuous moderators have been mean centered.
#subset of dataframe
mydata <- structure(list(cond = structure(c(1L, 2L, 3L, 3L, 2L, 3L, 1L,2L, 2L, 1L), contrasts = structure(c(-0.25, 0.0833333333333333,0.0833333333333333, 0.0833333333333333, -1.85037170770859e-17,-0.166666666666667, -0.166666666666667, 0.333333333333333, 0,-0.5, 0.5, 0), .Dim = 4:3, .Dimnames = list(c("control", "mindset","value", "MindsetValue"), NULL)), .Label = c("control", "mindset","value", "MindsetValue"), class = "factor"), sis_mp3_gpa = c(89.0557142857142,91.7514285714285, 94.8975, 87.05875, 69.9928571428571, 78.0357142857142,87.7328571428571, 83.8925, 61.2271428571428, 79.8314285714285), sis_female = c(1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L), low_gpa_m1m2 = c(0,0, 0, 0, 1, 1, 0, 0, 1, 0), sis_frpl = c(0L, 1L, 0L, 0L, 1L,1L, 1L, 1L, NA, 0L), Imp_pre.C = c(1.80112944983819, -0.198870550161812,1.46112944983819, -0.198870550161812, 0.131129449838188, NA,-0.538870550161812, 0.131129449838188, -0.198870550161812, -0.198870550161812), Val_pre.C = c(-2.2458357581636, -2.0458357581636, 0.554164241836405,0.554164241836405, -0.245835758163595, NA, 0.554164241836405,0.554164241836405, -2.0458357581636, -1.64583575816359)), row.names = c(323L,2141L, 2659L, 2532L, 408L, 179L, 747L, 2030L, 2183L, 733L), class = "data.frame")
Since I'm only interested in specific contrasts, I created the following user defined contrast codes.
mat = matrix(c(1/4, 1/4, 1/4, 1/4, -3, 1, 1, 1, 0, -1, -1, 2, 0,
mymat = solve(t(mat))
mymat
my.contrasts <- mymat[,2:4]
contrasts(mydata$cond) = my.contrasts
I'm testing the following model:
#model
model1 <- lm(sis_mp3_gpa ~ cond +
sis_female*cond +
low_gpa_m1m2*cond +
sis_frpl*cond +
Imp_pre.C*cond +
Val_pre.C*cond +
Imp_pre.C + Val_pre.C +
low_gpa_m1m2 + sis_female +
sis_frpl , data = mydata)
summary(model1)
In the full data set there is a significant interaction between contrast two (comparing mindset plus value to mindset & value) and previous value beliefs (i.e., Val_pre.C) and between this contrast code and having a low gpa the 2 previous semesters. To interpret significant interactions I would like to generate predicted values for individuals one standard deviation below and above the mean on the continuous moderator and the two levels of the categorical moderator. My problem is that when I have tried to do this the plots contain predicted values for each condition rather than collapsing the mindset and value condition and excluding the control condition. Also, when I try to do simple slope analyses I can only get pairwise comparisons for all four conditions rather than the contrasts I'm interested in.
How can I plot predicted values and conduct simple slope analyses for my contrast codes and a categorical and continuous moderator?

This question cannot be answered for two reasons: (1) the code for mat is incomplete and won't run, and (2) the model seriously over-fits the subsetted data provided, resulting in 0 d.f. for error and no way of testing anything.
However, here is an analysis of a much simpler model that may show how to do what is needed.
> simp = lm(sis_mp3_gpa ~ cond * Imp_pre.C, data = mydata)
> library(emmeans)
> emt = emtrends(simp, "cond", var = "Imp_pre.C")
> emt # estimated slopes
cond Imp_pre.C.trend SE df lower.CL upper.CL
control 1.97 7.91 3 -23.2 27.1
mindset 1.37 42.85 3 -135.0 137.7
value 4.72 12.05 3 -33.6 43.1
Confidence level used: 0.95
> # custom contrasts of those slopes
> contrast(emt, list(c1 = c(-1,1,0), c2 = c(-1,-1,2)))
contrast estimate SE df t.ratio p.value
c1 -0.592 43.6 3 -0.014 0.9900
c2 6.104 49.8 3 0.123 0.9102

Related

How to deal with highly skewed residuals in GLMM

I am trying to see if there is a relation between the proportion of target vs non target species in some communities and their phylogenetic structure. I cannot show you the real data but it looks something similar to this (although I have over 18000 data points and 'hab' has 12 levels):
df <- structure(list(hab = structure(c(6L, 6L, 6L, 6L, 6L, 6L, 6L,12L, 12L, 9L),
.Label = c("Eur_nitro_herb", "Forest_deciduous","Forest_everg_Eur",
"Forest_med", "Grass_alp", "Grass_mont_subalp","Margin_mantle", "Med_nitro_herb", "Rocks", "Shrub_med", "Shrub_mont_subalp","Wet_humid"), class = "factor"),
ses_mpd = c(0.05408747785078,-0.578266990137644, -1.48812316684822, -0.345401572814568, 0.124151290090708,-1.51817069020564, 0.0530607986221243, 0.00261416940904258, 0.665908557766837,-0.701477005797007),
target = c(1, 2, 3, 1, 2, 0, 0, 0, 0, 1),non_target = c(32, 27, 20, 30, 34, 26, 30, 9, 12, 6)), row.names = c(1L,2L, 3L, 4L, 5L, 6L, 7L, 18793L, 18794L, 18795L), class = "data.frame")
df
ses.mpd as calculated by the picante package is a standarized size effect so it can take positive and negative values as well as zero.
I want to see the relationship between proportion of target and non target with ses_mpd controlling for possible differences between habitats. I have used a Generalized Linear Mixed Model to do so but when I check the residuals they look fairly skewed:
library(lme4)
mod<-glmer(cbind(target,non_target)~ses_mpd+(1|hab), family = binomial(logit), data = df)
plot(mod, resid(., type='response')~fitted(.), main="Normalized Residuals v Fitted Values",abline=c(0,0))
res <- resid(mod, type="response")
qqnorm(res)
qqline(res)
Given the large sample size I was expecting that the residuals would be normal but I was clearly wrong. I guess that these results cannot be trusted so my question is if there is any other way to analyzed this data.
Cheers and thanks in advance.

Error running binomial GAM in mgcv with proportional data

I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")

Specify end points for different groups when plotting regression output in R

I'm hoping to get some help with presenting regression outputs for my Masters thesis. I am assessing the impacts of elephants on woody vegetation, particularly in relation to artificial waterholes. In addition to generally declining with distance from waterholes, the impacts differ substantially between the two vegetation types involved.
I've figured out what seems to me a satisfactory way to of plotting this using visreg. In the model output shown below, both distance to waterhole and veg type explained damage, hence my attempt to show both. However, the issue is that I only have samples at the furthest distances for waterholes (x-axis) from the red vegetation type. As you can see, the regression line for the blue veg type is extending beyond the last points for this vegetation type. Is there anyway I can get the blue line to stop at a smaller distance from the waterhole (x axis value) than for the red to avoid this?
See code for the model and plot below the visreg plot.
Sample data and code
> dput(vegdata[21:52, c(4,7,33)])
structure(list(distance = c(207L, 202L, 501L, 502L, 1001L, 1004L,
2010L, 1997L, 4003L, 3998L, 202L, 194L, 499L, 494L, 1004L, 1000L,
2008L, 1993L, 4008L, 3998L, 493L, 992L, 1941L, 2525L, 485L, 978L,
1941L, 3024L, 495L, 978L, 1977L, 2952L), vegtype = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("teak",
"term"), class = "factor"), toedl = c(35.48031025, 47.30482718,
25.16709533, 22.29360164, 17.6546533, 12.81605101, 20.34136734,
18.45809334, 11.3578081, 3.490830751, 60.54870317, 44.9863128,
18.81010698, 20.4777188, 30.36994386, 18.7417214, 21.52247156,
18.29685939, 30.26217664, 8.945486104, 43.95749178, 43.54799495,
44.42693993, 50.06207783, 48.05538594, 35.31220933, 52.37339094,
40.51569938, 41.45677007, 58.86629306, 37.80203313, 46.35633342
)), row.names = 21:52, class = "data.frame")
m1<-lm(toedl~vegtype+distance, data=vegdata)
summary(m1)
library(visreg)
visreg(oedl6, 'sexactd', by='vegtype',overlay=TRUE, gg=TRUE, points=list(size=2.5), ylab='% old elephant damage', xlab='distance from waterhole')
Regarding the comments about a reproducible example, you can just make a small dataframe with representative data like below, also a general comment that you should avoid naming your variables names of base functions like 'all'.
I'm not sure whether it's possible to use visreg to do what you want, but you can extract the information from your model using predict, then use ggplot to plot it, which may be preferable because ggplot is really good for customizing plots.
library(ggplot2)
library(visreg)
# Create reproducible data example
allData <- data.frame(vegtype = rep(c("t1", "t2"), each = 10),
oedl = c(seq(from = 35, to = 20, length.out = 10),
seq(from = 20, to = 5, length.out = 10)),
sexactd = c(seq(from = -1, to = 1, length.out = 10),
seq(from = -1, to = 2, length.out = 10)))
# Make linear model
oedl6 <- lm(formula = oedl ~ sexactd + vegtype, data = allData)
# Predict the data using the linear model
odelPred <- cbind(allData, predict(oedl6, interval = 'confidence'))
ggplot(odelPred, aes(sexactd, oedl, color = vegtype, fill = vegtype)) +
geom_point() + geom_line(aes(sexactd, fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3)
MR Macarthurs solution is great, and (s)he deserved the accepted answer. Visualising a multiple regression model with several predictors in a 2 dimensional graph is... difficult. Basically, you are limited to one predictor. And can add the interaction (in your case: vegtype). One can simply use
geom_smooth for it.
Using your data:
library(tidyverse)
ggplot(vegdata, aes(toedl, distance, color = vegtype)) +
geom_point() +
geom_smooth(method = 'lm')
Created on 2019-12-13 by the reprex package (v0.3.0)

Multiple Regression - Error in model.frame.default variable lengths differ

I'm trying to run a multiple regression with 3 independent variables, and 3 dependent variables. The question is based on how water quality influences plankton abundance in and between 3 different locations aka guzzlers. With water quality variables being pH, phosphates, and nitrates. Dependent/response variables would be the plankton abundance in each 3 locations.
Here is my code:
model1 <- lm(cbind(Abundance[Guzzler.. == 1], Abundance[Guzzler.. == 2],
Abundance[Guzzler.. == 3]) ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler)
And this is the error message I am getting:
Error in model.frame.default(formula = cbind(Abundance[Guzzler.. == 1], :
variable lengths differ (found for 'Phospates')
I think it has to do with how my data is set up but I'm not sure how to go about changing this to get the model to run. What I'm trying to see is how these water quality variables are affecting the abundance in the different locations and how they vary between. So it doesn't seem quite logical to try multiple models which was my only other thought.
Here is the output from dput(head(WQAbundancebyGuzzler)):
structure(list(ï..Date = structure(c(2L, 4L, 1L, 3L, 5L, 2L), .Label = c("11/16/2018",
"11/2/2018", "11/30/2018", "11/9/2018", "12/7/2018"), class = "factor"),
Guzzler.. = c(1L, 1L, 1L, 1L, 1L, 2L), Phospates = c(2L,
2L, 2L, 2L, 2L, 1L), Nitrates = c(0, 0.3, 0, 0.15, 0, 0),
pH = c(7.5, 8, 7.5, 7, 7, 8), Air.Temp..C. = c(20.8, 25.4,
20.9, 16.8, 19.4, 27.4), Relative.Humidity... = c(62L, 31L,
41L, 59L, 59L, 43L), DO2.Concentration..mg.L. = c(3.61, 4.48,
3.57, 5.65, 2.45, 5.86), Water.Temp..C. = c(14.1, 11.5, 11.8,
13.9, 11.1, 17.8), Abundance = c(98L, 43L, 65L, 55L, 54L,
29L)), .Names = c("ï..Date", "Guzzler..", "Phospates", "Nitrates",
"pH", "Air.Temp..C.", "Relative.Humidity...", "DO2.Concentration..mg.L.",
"Water.Temp..C.", "Abundance"), row.names = c(NA, 6L), class = "data.frame")
I think the problem here is more theoretical: You say that you have three dependent variables that you want to enter into a multiple linear regression. However, at least in classic linear regression, there can only be one dependent variable. There might be ways around this, but I think in your case, one dependent variable works just fine: It's `Abundance´. Now you you have sampled three different locations: One solution to account for this could be to just enter the location as a categorical independent variable. So I would propose the following model:
# Make sure that Guzzler is not treated as numeric
WQAbundancebyGuzzler$Guzzler <- as.factor(WQAbundancebyGuzzler$Guzzler)
# Model with 4 independent variables
model1 <- lm(Abundance ~ Guzzler + Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler)
It's probably also wise to think about possible interactions here, especially between Guzzler and the other independent variables.
The reason for your error is, that you try to subset only "Abundance" but not the other variables. So as a result their lenghts differ. You need to subset the whole data, e.g.
lm(Abundance ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(1, 2, 3), ])
With given head(WQAbundancebyGuzzler)
lm(Abundance ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(29, 43, 65), ])
results in
# Call:
# lm(formula = Abundance ~ Phospates + Nitrates + pH, data = WQAbundancebyGuzzler
# [WQAbundancebyGuzzler$Abundance %in%
# c(29, 43, 65), ])
#
# Coefficients:
# (Intercept) Phospates Nitrates pH
# -7.00 36.00 -73.33 NA

Model fit failed in Caret GBM and all the RMSE metric values are missing:

I have encountered the errors as indicated in the title and looked through the solutions for similar problems posted online by checking for null values, changing the predictor variables to numeric as well as pre-process the variables with center and scale but to no avail.
I am able to run the model using the same data in Caret for RF as well as with a range of tunegrid options for each GBM parameter but not when I specify the optimal value for each GBM parameter.
My train data comprises a regression target variable (Gross.Salary0) and my predictor variables are either factor(binary) or numeric. There are no missing values in my data. A subset of data without the full number of variables is as follows:
structure(list(Gross.Salary0 = c(3043.7, 4170, 3148.4, 3678.4, 3586.4,
3126.4), Gender.MALE. = structure(c(1L, 2L, 1L, 1L, 2L, 1L), .Label =
c("0", "1"), class = "factor"), Certificate...HQA.MASTER.S.DEGREE....Outflow.Date..2.
= c(0L, 1875929344L, 0L, 1706185636L, 0L, 0L), Certificate...HQA.HONS.I.
= structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class =
"factor"), Year.of.Inflow... = c(2009L, 2009L, 2009L, 2009L, 2009L,
2009L), Year.of.Inflow..2. = c(4036081L, 4036081L, 4036081L, 4036081L,
4036081L, 4036081L), Age...5....Agency.10. = c(0, 0, 0, 0, 0, 0),
Inflow.Date..2....Agency.10. = c(0L, 0L, 0L, 0L, 0L, 0L)), row.names =
c(NA, 6L), class = "data.frame")
To obtain the optimal tuning parameters for Caret GBM in R, I manage to run the following code:
fit_control2 <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
grid <- expand.grid(n.trees=c(10,20,50,100,500,1000),shrinkage=c(0.01,0.05,0.1,0.5),n.minobsinnode = c(3,5,10),interaction.depth=c(1,5,10))
gbm_model2 <-train(Gross.Salary0 ~ ., data=train, method='gbm',trControl=fit_control2, tuneGrid=grid)
gbm_model2
The results produce the lowest RMSE when n.trees = 1000, interaction.depth = 1, shrinkage = 0.05 and n.minobsinnode = 3.
I run the final GBM model using the optimal tuning parameters but return with a model fit error and all missing RMSE values.
fit_control3 <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
tunegrid <- expand.grid(n.trees=1000, interaction.depth = 1, shrinkage = 0.05, n.minobsinnode = 3)
gbm_model3 <- train(Gross.Salary0 ~ ., data=train, method="gbm", tunegrid =tunegrid, trControl=fit_control3)
Other than the missing RMSE values, there's 50 or more warnings with a sample as follows:
50: model fit failed for Fold07.Rep2: shrinkage=0.1, interaction.depth=2, n.minobsinnode=10, n.trees=150 Error in (function (x, y, offset = NULL, misc = NULL, distribution = "bernoulli", :
unused argument (tunegrid = list(1000, 1, 0.05, 3))
Though my target variable is numeric (regression), I notice that the distribution indicated in the warning shows bernoulli, I've thus specified the distribution in my model to be gaussian but R still returns the same error.
Appreciate your help please, thanks.

Resources