I'm trying to run a multiple regression with 3 independent variables, and 3 dependent variables. The question is based on how water quality influences plankton abundance in and between 3 different locations aka guzzlers. With water quality variables being pH, phosphates, and nitrates. Dependent/response variables would be the plankton abundance in each 3 locations.
Here is my code:
model1 <- lm(cbind(Abundance[Guzzler.. == 1], Abundance[Guzzler.. == 2],
Abundance[Guzzler.. == 3]) ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler)
And this is the error message I am getting:
Error in model.frame.default(formula = cbind(Abundance[Guzzler.. == 1], :
variable lengths differ (found for 'Phospates')
I think it has to do with how my data is set up but I'm not sure how to go about changing this to get the model to run. What I'm trying to see is how these water quality variables are affecting the abundance in the different locations and how they vary between. So it doesn't seem quite logical to try multiple models which was my only other thought.
Here is the output from dput(head(WQAbundancebyGuzzler)):
structure(list(ï..Date = structure(c(2L, 4L, 1L, 3L, 5L, 2L), .Label = c("11/16/2018",
"11/2/2018", "11/30/2018", "11/9/2018", "12/7/2018"), class = "factor"),
Guzzler.. = c(1L, 1L, 1L, 1L, 1L, 2L), Phospates = c(2L,
2L, 2L, 2L, 2L, 1L), Nitrates = c(0, 0.3, 0, 0.15, 0, 0),
pH = c(7.5, 8, 7.5, 7, 7, 8), Air.Temp..C. = c(20.8, 25.4,
20.9, 16.8, 19.4, 27.4), Relative.Humidity... = c(62L, 31L,
41L, 59L, 59L, 43L), DO2.Concentration..mg.L. = c(3.61, 4.48,
3.57, 5.65, 2.45, 5.86), Water.Temp..C. = c(14.1, 11.5, 11.8,
13.9, 11.1, 17.8), Abundance = c(98L, 43L, 65L, 55L, 54L,
29L)), .Names = c("ï..Date", "Guzzler..", "Phospates", "Nitrates",
"pH", "Air.Temp..C.", "Relative.Humidity...", "DO2.Concentration..mg.L.",
"Water.Temp..C.", "Abundance"), row.names = c(NA, 6L), class = "data.frame")
I think the problem here is more theoretical: You say that you have three dependent variables that you want to enter into a multiple linear regression. However, at least in classic linear regression, there can only be one dependent variable. There might be ways around this, but I think in your case, one dependent variable works just fine: It's `Abundance´. Now you you have sampled three different locations: One solution to account for this could be to just enter the location as a categorical independent variable. So I would propose the following model:
# Make sure that Guzzler is not treated as numeric
WQAbundancebyGuzzler$Guzzler <- as.factor(WQAbundancebyGuzzler$Guzzler)
# Model with 4 independent variables
model1 <- lm(Abundance ~ Guzzler + Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler)
It's probably also wise to think about possible interactions here, especially between Guzzler and the other independent variables.
The reason for your error is, that you try to subset only "Abundance" but not the other variables. So as a result their lenghts differ. You need to subset the whole data, e.g.
lm(Abundance ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(1, 2, 3), ])
With given head(WQAbundancebyGuzzler)
lm(Abundance ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(29, 43, 65), ])
results in
# Call:
# lm(formula = Abundance ~ Phospates + Nitrates + pH, data = WQAbundancebyGuzzler
# [WQAbundancebyGuzzler$Abundance %in%
# c(29, 43, 65), ])
#
# Coefficients:
# (Intercept) Phospates Nitrates pH
# -7.00 36.00 -73.33 NA
Related
I have a mixed model with a interaction of two continuous variables. I understand how to use predict() for a continuous by categorical interaction, but can't find any information on how to use predict() to generate graphs of continuous by continuous interactions. So far I have:
#the data
mydata<-structure(list(Week = c(3L, 3L, 6L, 6L, 5L, 6L, 3L, 1L, 4L, 3L,
1L, 2L, 6L, 6L, 6L, 6L), X2 = c(20.8, 21.4, 22.2, 21.9, 21, 21.8,
16.6, 15.6, 21.9, 19.8, 17.5, 12.5, 20.1, 20.5, 21.7, 22.3),
X1 = c(78L, 90L, 81L, 44L, 9L, 35L, 99L, 17L, 1L, 7L, 23L,
14L, 9L, 77L, 84L, 1L), Y = c(14.97469781, 19.88267242, 15.59780954,
9.633809968, 15.12038794, 10.43636012, 10.7436911, 16.71840387,
12.43274774, 10.90741585, 8.79514591, 14.1932374, 8.776376951,
9.995133069, 12.38314719, 9.611533444)), class = "data.frame", row.names = c(NA,
-16L))
#assigning 'Week' as a factor
mydata$Week<-as.factor(mydata$Week)
#the model
model1<-glmer(Y~X1*X2+(1|Week),data=mydata, family=Gamma(link='log'))
NEWDATA <-
expand.grid(
X1 = seq(1, 99, length = 100),
X2 = seq(12.5, 22.3, length = 100),
Week = levels(mydata$Week)
)
PREDMASS <-
predict(model1,
newdata = NEWDATA,
re.form = ~ (1 | Week))
PREDSFRAME <- cbind(NEWDATA, PREDMASS)
head(PREDSFRAME)
If the interaction were between a continuous and a categorical variable, I would then use the code below, but this doesn't work:
ggplot(PREDSFRAME, aes(x = X1, y = PREDMASS)) +
geom_line() +
geom_point(data = mydata,
facet_grid(. ~ X2) +
aes(y = Y),
alpha = 0.3)
Any suggestions?
I think you actually want the facet_grid outside the geom_point() function. If you run it that way you won't get an error.
ggplot(PREDSFRAME, aes(x = X1, y = PREDMASS)) +
geom_line() +
geom_point(data = mydata,
aes(y = Y),
alpha = 0.3)+
facet_grid(. ~ X2)
But what you get is a grid of plots for every value of X2 (because it is continuous), which is also not what you want.
What you need to do is specify some X2 values that you would create different plots (or regression lines of), since I am assuming (perhaps incorrectly) that you don't want to plot every possible combination (which is a 2D plane, as #HongOoi suggests).
I know you ask for a solution using predict, which perhaps you can solve with the above information, but I offer this "pre-made" solution from sjPlot that I find quick and helpful:
library(sjPlot)
plot_model(model1, type="pred",terms=c("X1","X2"))
I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")
I am struggling examining significant interactions from a linear regression with user defined contrast codes. I conducted an intervention to improve grades with three treatment groups (i.e., mindset, value, mindset plus value) and one control group and would now like to see if there is an interaction between specific intervention levels and various theoretically relevant categorical variables such as gender, free lunch status, and a binary indicator for having a below average gpa the two previous semesters, and relevant continuous variables such as pre-intervention beliefs (Imp_pre.C & Val_pre.C in the sample dataframe below). The continuous moderators have been mean centered.
#subset of dataframe
mydata <- structure(list(cond = structure(c(1L, 2L, 3L, 3L, 2L, 3L, 1L,2L, 2L, 1L), contrasts = structure(c(-0.25, 0.0833333333333333,0.0833333333333333, 0.0833333333333333, -1.85037170770859e-17,-0.166666666666667, -0.166666666666667, 0.333333333333333, 0,-0.5, 0.5, 0), .Dim = 4:3, .Dimnames = list(c("control", "mindset","value", "MindsetValue"), NULL)), .Label = c("control", "mindset","value", "MindsetValue"), class = "factor"), sis_mp3_gpa = c(89.0557142857142,91.7514285714285, 94.8975, 87.05875, 69.9928571428571, 78.0357142857142,87.7328571428571, 83.8925, 61.2271428571428, 79.8314285714285), sis_female = c(1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L), low_gpa_m1m2 = c(0,0, 0, 0, 1, 1, 0, 0, 1, 0), sis_frpl = c(0L, 1L, 0L, 0L, 1L,1L, 1L, 1L, NA, 0L), Imp_pre.C = c(1.80112944983819, -0.198870550161812,1.46112944983819, -0.198870550161812, 0.131129449838188, NA,-0.538870550161812, 0.131129449838188, -0.198870550161812, -0.198870550161812), Val_pre.C = c(-2.2458357581636, -2.0458357581636, 0.554164241836405,0.554164241836405, -0.245835758163595, NA, 0.554164241836405,0.554164241836405, -2.0458357581636, -1.64583575816359)), row.names = c(323L,2141L, 2659L, 2532L, 408L, 179L, 747L, 2030L, 2183L, 733L), class = "data.frame")
Since I'm only interested in specific contrasts, I created the following user defined contrast codes.
mat = matrix(c(1/4, 1/4, 1/4, 1/4, -3, 1, 1, 1, 0, -1, -1, 2, 0,
mymat = solve(t(mat))
mymat
my.contrasts <- mymat[,2:4]
contrasts(mydata$cond) = my.contrasts
I'm testing the following model:
#model
model1 <- lm(sis_mp3_gpa ~ cond +
sis_female*cond +
low_gpa_m1m2*cond +
sis_frpl*cond +
Imp_pre.C*cond +
Val_pre.C*cond +
Imp_pre.C + Val_pre.C +
low_gpa_m1m2 + sis_female +
sis_frpl , data = mydata)
summary(model1)
In the full data set there is a significant interaction between contrast two (comparing mindset plus value to mindset & value) and previous value beliefs (i.e., Val_pre.C) and between this contrast code and having a low gpa the 2 previous semesters. To interpret significant interactions I would like to generate predicted values for individuals one standard deviation below and above the mean on the continuous moderator and the two levels of the categorical moderator. My problem is that when I have tried to do this the plots contain predicted values for each condition rather than collapsing the mindset and value condition and excluding the control condition. Also, when I try to do simple slope analyses I can only get pairwise comparisons for all four conditions rather than the contrasts I'm interested in.
How can I plot predicted values and conduct simple slope analyses for my contrast codes and a categorical and continuous moderator?
This question cannot be answered for two reasons: (1) the code for mat is incomplete and won't run, and (2) the model seriously over-fits the subsetted data provided, resulting in 0 d.f. for error and no way of testing anything.
However, here is an analysis of a much simpler model that may show how to do what is needed.
> simp = lm(sis_mp3_gpa ~ cond * Imp_pre.C, data = mydata)
> library(emmeans)
> emt = emtrends(simp, "cond", var = "Imp_pre.C")
> emt # estimated slopes
cond Imp_pre.C.trend SE df lower.CL upper.CL
control 1.97 7.91 3 -23.2 27.1
mindset 1.37 42.85 3 -135.0 137.7
value 4.72 12.05 3 -33.6 43.1
Confidence level used: 0.95
> # custom contrasts of those slopes
> contrast(emt, list(c1 = c(-1,1,0), c2 = c(-1,-1,2)))
contrast estimate SE df t.ratio p.value
c1 -0.592 43.6 3 -0.014 0.9900
c2 6.104 49.8 3 0.123 0.9102
So I have the following dataset -
dat <- structure(list(cases = c(2L, 6L, 10L, 8L, 12L, 9L, 28L, 28L,
36L, 32L, 46L, 47L), qrt = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L), date = c(83, 83.25, 83.5, 83.75, 84, 84.25,
84.5, 84.75, 85, 85.25, 85.5, 85.75)), .Names = c("cases", "qrt",
"date"), class = "data.frame", row.names = c(NA, -12L))
cases qrt date
2 1 83.00
6 2 83.25
10 3 83.50
8 4 83.75
12 1 84.00
9 2 84.25
28 3 84.50
28 4 84.75
36 1 85.00
32 2 85.25
46 3 85.50
47 4 85.75
There are more data points, but to make things look a bit simpler I omitted them.
And to this dataset I have fit a GLM:
fit <- glm(cases~date+qrt, family = poisson, data = dat)
Basically, I would like to create a plot for this fitted values that this GLM produces that looks like this (this is actually the plot for the full data set,the black circles are the original data and the empty circles are the fitted data)
with the repeating x-values qrt on the x-axis.I'm assuming I'd have to use the predict() function and then plot the resulting values, but I've tried this and I get x-values on the x-axis going from 1 to 12 instead of repeating 1,2,3,4,1,2,3,4 etc. Also, how would you plot the original data over the fitted values, as in the plot above?
It is not difficult. Just use axis to control axis display:
## disable "x-axis" when `plot` fitted values
## remember to set decent `ylim` for your plot
plot(fit$fitted, xaxt = "n", xlab = "qtr", ylab = "cases", main = "GLM",
ylim = range(c(fit$fitted, dat$cases)) )
## manually add "x-axis", with "labels" and "las"
axis(1, at = 1:12, labels = rep.int(1:4, 3), las = 2)
## add original observed cases
points(dat$cases, pch = 19)
You don't need to use predict here. You have no gap / missing values in your quarterly time series, so the fitted values inside fitted model fit is all you need.
with ggplot:
df <- rbind(data.frame(index=as.factor(1:nrow(dat)), value=dat$cases, cases='actual'),
data.frame(index=as.factor(1:nrow(dat)), value=predict(fit, type='response'), cases='predicted'))
library(ggplot2)
ggplot(df, aes(index, value, color=cases)) + geom_point(cex=3) +
scale_color_manual(values=c('black', 'gray')) +
scale_y_continuous(breaks=seq(0, max(df$value)+5, 5)) + theme_bw()
I have a data frame which has two types of 'groups,' the densities of which I would like to overlay on the same graph.
using ggplot, I tried to graph the density using the following two lines of code:
full$group <- factor(full$group)
ggplot(full, aes(x=income, fill=group)) + geom_density()
The issue with this is that the it does not take the frequency variable (freq) into account, and simply calculates the frequency itself. That is an issue because there is exactly one row for every income-group combination.
I believe I have two options, each of which has a question:
a) Should I plot the graph using the way the data is currently formatted? If so, how would I do that?
b) Should I reformat the data to make the frequency of each group/income combination equivalent to the freq variable assigned to it? If so, how would I do that?
This is the kind of graph I would like, where "income" = "rating" and "group" = "cond":
dput of 'full':
full <- structure(list(income = c(10000, 19000, 29000, 39000, 49000, 75000, 99000, 1e+05, 10000, 19000,29000, 39000, 49000, 75000, 99000, 1e+05),
group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("one", "two"), class = "factor"),
freq = c(1237, 1791, 743, 291, 256, 212, 29, 11, 921, 1512, 614, 301, 209, 223, 48, 1)), .Names = c("income", "group", "freq"),
row.names = c(NA, 16L), class = "data.frame")
You can repeat the observations by their frequency with
ggplot(full[rep(1:nrow(full), full$freq),]) +
geom_density(aes(x=income, fill=group), color="black", alpha=.75, adjust=4)
Of course with your data this produces a pretty lousy plot
When estimating a density, your data should be observations from a continuous distribution. Here you really have a discrete distribution with repeated observations (in a true continuous distribution, the probability of seeing any value more than once is 0).
You could try to smooth this curve by setting the adjust= parameter to a number >1, (like 3 or 4). But really, your input data is just not in an appropriate form for a density plot. A bar plot would be a better choice. Maybe something like
ggplot(full, aes(as.factor(income), freq, fill=group)) +
geom_bar(stat="identity", position="dodge")