Related
I am working with physical activity data and follow-up pain data. I have a large dataset but for the shake of this example I have created a small one with the variables of my interest.
As my physical activity data are compositional in nature I am using compositional data analysis before using those variables as predictors in my mixed-effects model. My goal is to use the predict() function to predict some new data that I have created but I am receiving the folowing:
Error in rep(0, nobs) : invalid 'times' argument
I have googled it and I saw a post that was posted a few year ago but the answer did not work for mine.
Below is the dataset and my code:
library("tidyverse")
library("compositions")
library("robCompositions")
library("lme4")
dataset <- structure(list(work = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
department = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
worker = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L,
4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"),
age = c(45, 43, 65, 45, 76, 34, 65, 23, 23, 45, 32, 76),
sex = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L,
2L, 2L), .Label = c("1", "2"), class = "factor"), pain = c(4,
5, 3, 2, 0, 7, 8, 10, 1, 4, 5, 4), lpa_w = c(45, 65, 43,
76, 98, 65, 34, 56, 2, 3, 12, 34), mvpa_w = c(12, 54, 76,
87, 45, 23, 65, 23, 54, 76, 23, 54), lpa_l = c(54, 65, 34,
665, 76, 87, 12, 34, 54, 12, 45, 12), mvpa_l = c(12, 43,
56, 87, 12, 54, 76, 87, 98, 34, 56, 23)), class = "data.frame", row.names = c(NA,
-12L))
#create compositions of physical activity
dataset$comp_w <- acomp(cbind(lpa_w = dataset[,7],
mvpa_w = dataset[,8]))
dataset$comp_l <- acomp(cbind(lpa_l = dataset[,9],
mvpa_l = dataset[,10]))
#Make a grid to use for predictions for composition of lpa_w and mvpa_w
mygrid=rbind(
expand.grid(lpa_w = seq(min(2), max(98),5),
mvpa_w = seq(min(12), max(87), 5)))
griddata <- acomp(mygrid)
#run the model
model <- lmer(pain ~ ilr(comp_w) + age + sex + ilr(comp_l) +
(1 | work / department / worker),
data = dataset)
(prediction = predict(model, newdata = list(comp_w = griddata,
age = rep(mean(dataset$age, na.rm=TRUE),nrow(griddata)),
sex = rep("1", nrow(griddata)),
comp_l = do.call("rbind", replicate(n=nrow(griddata), mean(acomp(dataset[,12])), simplify = FALSE)),
work = rep(dataset$work, nrow(griddata)),
department = rep(dataset$department, nrow(griddata)),
worker = rep(dataset$worker, nrow(griddata)))))
Any help would be greatly appreciated.
Thanks
Assigning the results of acomp to an element of a data frame gives a weird data structure that messes things up downstream.
Constructing this data set (without messing up the original dataset):
dataset_weird <- dataset
dataset_weird$comp_w <- acomp(cbind(lpa_w = dataset[,7],
mvpa_w = dataset[,8]))
dataset_weird$comp_l <- acomp(cbind(lpa_l = dataset[,9],
mvpa_l = dataset[,10]))
The result is so weird that str(dataset_weird), the usual way of investigating the structure of an R object, fails with
$ comp_w :Error in unclass(x)[i, , drop = drop] :
(subscript) logical subscript too long
If we run sapply(dataset_weird, class) we see that these elements have class acomp. (They also appear to have an odd print() method: when we print(dataset_weird$comp_w) the results are a matrix of strings, but if we unclass(dataset_weird$comp_w) we can see that the underlying object is numeric [!])
This whole problem is kind of tricky since you're working with n-column matrices that are getting converted to special acomp() objects that are then getting converted to (n-1)-dimensional matrices (isometric log-ratio-transformed compositional data), the columns of which are then getting used as predictors. The basic point is that lme4's machinery will get confused if you have elements in your data frame that are not simple one-dimensional vectors. So you have to do the work of creating data frame columns yourself.
Here's what I came up with, with one missing piece (described below):
## utility function: *either* uses a matrix argument (`comp_data`)
## *or* extracts relevant columns from a data frame (`data`):
## returns ilr-transformed values as a matrix, with appropriate column names
ilr_dat <- function(data, suffix = NULL, comp_data = NULL) {
if (!is.null(suffix) && is.null(comp_data)) {
comp_data <- as.matrix(data[grep(paste0(suffix,"$"), names(data))])
}
ilrmat <- ilr(acomp(comp_data))
colnames(ilrmat) <- paste0("ilr", suffix, ".", 1:ncol(ilrmat))
return(ilrmat)
}
## augment original data set (without weird compositional elements)
## using data.frame() rather than $<- or rbind() collapses matrix arguments
## to data frame rows in a way that R expects
dataset2 <- data.frame(dataset, ilr_dat(dataset, "_l"))
dataset2 <- data.frame(dataset2, ilr_dat(dataset, "_w"))
mygrid <- rbind(
expand.grid(lpa_w = seq(min(2), max(98),5),
mvpa_w = seq(min(12), max(87), 5)))
## generate ilr data for prediction
griddata <- as.data.frame(ilr_dat(comp_data=mygrid, suffix="_w"))
#run the model: ilr(comp_l) **not** included, see below
model <- lmer(pain ~ ilr_w.1 + age + sex + ## ilr(comp_l) +
(1 | work / department / worker),
data = dataset2)
## utility function for replication
xfun <- function(s) rep(dataset[[s]], nrow(griddata))
predict(model, newdata = data.frame(griddata,
age = mean(dataset$age, na.rm=TRUE),
sex = "1",
work = xfun("work"),
department = xfun("department"),
worker = xfun("worker")))
This seems to work.
The reason I did not include the _l composition/irl in the model or the predictions is that I couldn't understand what this statement was doing:
comp_l = do.call("rbind", replicate(n=nrow(griddata), mean(acomp(dataset[,12])), simplify = FALSE))
I'm trying to run a GAM on proportional data (numeric between 0 and 1). But I'm getting the warning
In eval(family$initialize) : non-integer #successes in a binomial glm!
Basically I'm modelling the number of occurrences of warm adapted species vs total occurrences of warm and cold adapted species against sea surface temperature and using data from another weather system (NAO) as a random effect, and three other categorical, parametric, variables.
m5 <- gam(prop ~ s(SST_mean) + s(NAO, bs="re") + WarmCold + Cycle6 + Region,
family=binomial, data=DAT_WC, method = "REML")
prop = proportion of occurrences, WarmCold = whether species is warm adapted or cold adapted, Cycle6 = 6 year time period, Region = one of 4 regions. A sample of my dataset is below
structure(list(WarmCold = structure(c(1L, 1L, 1L, 1L, 2L, 2L), .Label = c("Cold",
"Warm"), class = "factor"), Season = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Autumn", "Spring", "Summer", "Winter"
), class = "factor"), Region = structure(c(1L, 2L, 3L, 4L, 1L,
2L), .Label = c("OSPARII_N", "OSPARII_S", "OSPARIII_N", "OSPARIII_S"
), class = "factor"), Cycle6 = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("1990-1995", "1996-2001", "2002-2007", "2008-2013",
"2014-2019"), class = "factor"), WC.Strandings = c(18L, 10L,
0L, 3L, 5L, 25L), SST_mean = c(7.4066298185553, 7.49153086390094,
9.28247524767124, 10.8654859624361, 7.4066298185553, 7.49153086390094
), NAO = c(0.542222222222222, 0.542222222222222, 0.542222222222222,
0.542222222222222, 0.542222222222222, 0.542222222222222), AMO = c(-0.119444444444444,
-0.119444444444444, -0.119444444444444, -0.119444444444444, -0.119444444444444,
-0.119444444444444), Total.Strandings = c(23, 35, 5, 49, 23,
35), prop = c(0.782608695652174, 0.285714285714286, 0, 0.0612244897959184,
0.217391304347826, 0.714285714285714)), row.names = c(NA, 6L), class = "data.frame")
From the literature (Zuur, 2009) it seems that a binomial distribution is the best used for proportional data. But it doesn't seem to be working. It's running but giving the above warning, and outputs that don't make sense. What am I doing wrong here?
This is a warning, not an error, but it does indicate that something is somewhat not correct; the binomial distribution has support on the non-negative integer values so it doesn't make sense to pass in non-integer values without the samples totals from which the proportions were formed.
You can do this using the weights argument, which in this case should take a vector of integers containing the count total for each observation from which the proportion was computed.
Alternatively, consider using family = quasibinomial if the mean-variance relationship is right for your data; the warming will go away, but then you'll not be able to use AIC and related tools that expect a real likelihood.
If you proportions are true proportions then consider family = betar to fit a beta regression model, where the conditional distribution of the response has support on reals values on the unit interval (0, 1) (but technically not 0 or 1 — mgcv will add or subtract a small number to adjust the data if there are 0 or 1 values in the response).
I also found that rather than calculating a total, but using cbind() with the 2 columns of interest removed the warning e.g.
m8 <- gam(cbind(WC.Strandings, Total.Strandings) ~ s(x1) + x2,
family=binomial(link="logit"), data=DAT, method = "REML")
I am struggling examining significant interactions from a linear regression with user defined contrast codes. I conducted an intervention to improve grades with three treatment groups (i.e., mindset, value, mindset plus value) and one control group and would now like to see if there is an interaction between specific intervention levels and various theoretically relevant categorical variables such as gender, free lunch status, and a binary indicator for having a below average gpa the two previous semesters, and relevant continuous variables such as pre-intervention beliefs (Imp_pre.C & Val_pre.C in the sample dataframe below). The continuous moderators have been mean centered.
#subset of dataframe
mydata <- structure(list(cond = structure(c(1L, 2L, 3L, 3L, 2L, 3L, 1L,2L, 2L, 1L), contrasts = structure(c(-0.25, 0.0833333333333333,0.0833333333333333, 0.0833333333333333, -1.85037170770859e-17,-0.166666666666667, -0.166666666666667, 0.333333333333333, 0,-0.5, 0.5, 0), .Dim = 4:3, .Dimnames = list(c("control", "mindset","value", "MindsetValue"), NULL)), .Label = c("control", "mindset","value", "MindsetValue"), class = "factor"), sis_mp3_gpa = c(89.0557142857142,91.7514285714285, 94.8975, 87.05875, 69.9928571428571, 78.0357142857142,87.7328571428571, 83.8925, 61.2271428571428, 79.8314285714285), sis_female = c(1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L), low_gpa_m1m2 = c(0,0, 0, 0, 1, 1, 0, 0, 1, 0), sis_frpl = c(0L, 1L, 0L, 0L, 1L,1L, 1L, 1L, NA, 0L), Imp_pre.C = c(1.80112944983819, -0.198870550161812,1.46112944983819, -0.198870550161812, 0.131129449838188, NA,-0.538870550161812, 0.131129449838188, -0.198870550161812, -0.198870550161812), Val_pre.C = c(-2.2458357581636, -2.0458357581636, 0.554164241836405,0.554164241836405, -0.245835758163595, NA, 0.554164241836405,0.554164241836405, -2.0458357581636, -1.64583575816359)), row.names = c(323L,2141L, 2659L, 2532L, 408L, 179L, 747L, 2030L, 2183L, 733L), class = "data.frame")
Since I'm only interested in specific contrasts, I created the following user defined contrast codes.
mat = matrix(c(1/4, 1/4, 1/4, 1/4, -3, 1, 1, 1, 0, -1, -1, 2, 0,
mymat = solve(t(mat))
mymat
my.contrasts <- mymat[,2:4]
contrasts(mydata$cond) = my.contrasts
I'm testing the following model:
#model
model1 <- lm(sis_mp3_gpa ~ cond +
sis_female*cond +
low_gpa_m1m2*cond +
sis_frpl*cond +
Imp_pre.C*cond +
Val_pre.C*cond +
Imp_pre.C + Val_pre.C +
low_gpa_m1m2 + sis_female +
sis_frpl , data = mydata)
summary(model1)
In the full data set there is a significant interaction between contrast two (comparing mindset plus value to mindset & value) and previous value beliefs (i.e., Val_pre.C) and between this contrast code and having a low gpa the 2 previous semesters. To interpret significant interactions I would like to generate predicted values for individuals one standard deviation below and above the mean on the continuous moderator and the two levels of the categorical moderator. My problem is that when I have tried to do this the plots contain predicted values for each condition rather than collapsing the mindset and value condition and excluding the control condition. Also, when I try to do simple slope analyses I can only get pairwise comparisons for all four conditions rather than the contrasts I'm interested in.
How can I plot predicted values and conduct simple slope analyses for my contrast codes and a categorical and continuous moderator?
This question cannot be answered for two reasons: (1) the code for mat is incomplete and won't run, and (2) the model seriously over-fits the subsetted data provided, resulting in 0 d.f. for error and no way of testing anything.
However, here is an analysis of a much simpler model that may show how to do what is needed.
> simp = lm(sis_mp3_gpa ~ cond * Imp_pre.C, data = mydata)
> library(emmeans)
> emt = emtrends(simp, "cond", var = "Imp_pre.C")
> emt # estimated slopes
cond Imp_pre.C.trend SE df lower.CL upper.CL
control 1.97 7.91 3 -23.2 27.1
mindset 1.37 42.85 3 -135.0 137.7
value 4.72 12.05 3 -33.6 43.1
Confidence level used: 0.95
> # custom contrasts of those slopes
> contrast(emt, list(c1 = c(-1,1,0), c2 = c(-1,-1,2)))
contrast estimate SE df t.ratio p.value
c1 -0.592 43.6 3 -0.014 0.9900
c2 6.104 49.8 3 0.123 0.9102
I'm trying to run a multiple regression with 3 independent variables, and 3 dependent variables. The question is based on how water quality influences plankton abundance in and between 3 different locations aka guzzlers. With water quality variables being pH, phosphates, and nitrates. Dependent/response variables would be the plankton abundance in each 3 locations.
Here is my code:
model1 <- lm(cbind(Abundance[Guzzler.. == 1], Abundance[Guzzler.. == 2],
Abundance[Guzzler.. == 3]) ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler)
And this is the error message I am getting:
Error in model.frame.default(formula = cbind(Abundance[Guzzler.. == 1], :
variable lengths differ (found for 'Phospates')
I think it has to do with how my data is set up but I'm not sure how to go about changing this to get the model to run. What I'm trying to see is how these water quality variables are affecting the abundance in the different locations and how they vary between. So it doesn't seem quite logical to try multiple models which was my only other thought.
Here is the output from dput(head(WQAbundancebyGuzzler)):
structure(list(ï..Date = structure(c(2L, 4L, 1L, 3L, 5L, 2L), .Label = c("11/16/2018",
"11/2/2018", "11/30/2018", "11/9/2018", "12/7/2018"), class = "factor"),
Guzzler.. = c(1L, 1L, 1L, 1L, 1L, 2L), Phospates = c(2L,
2L, 2L, 2L, 2L, 1L), Nitrates = c(0, 0.3, 0, 0.15, 0, 0),
pH = c(7.5, 8, 7.5, 7, 7, 8), Air.Temp..C. = c(20.8, 25.4,
20.9, 16.8, 19.4, 27.4), Relative.Humidity... = c(62L, 31L,
41L, 59L, 59L, 43L), DO2.Concentration..mg.L. = c(3.61, 4.48,
3.57, 5.65, 2.45, 5.86), Water.Temp..C. = c(14.1, 11.5, 11.8,
13.9, 11.1, 17.8), Abundance = c(98L, 43L, 65L, 55L, 54L,
29L)), .Names = c("ï..Date", "Guzzler..", "Phospates", "Nitrates",
"pH", "Air.Temp..C.", "Relative.Humidity...", "DO2.Concentration..mg.L.",
"Water.Temp..C.", "Abundance"), row.names = c(NA, 6L), class = "data.frame")
I think the problem here is more theoretical: You say that you have three dependent variables that you want to enter into a multiple linear regression. However, at least in classic linear regression, there can only be one dependent variable. There might be ways around this, but I think in your case, one dependent variable works just fine: It's `Abundance´. Now you you have sampled three different locations: One solution to account for this could be to just enter the location as a categorical independent variable. So I would propose the following model:
# Make sure that Guzzler is not treated as numeric
WQAbundancebyGuzzler$Guzzler <- as.factor(WQAbundancebyGuzzler$Guzzler)
# Model with 4 independent variables
model1 <- lm(Abundance ~ Guzzler + Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler)
It's probably also wise to think about possible interactions here, especially between Guzzler and the other independent variables.
The reason for your error is, that you try to subset only "Abundance" but not the other variables. So as a result their lenghts differ. You need to subset the whole data, e.g.
lm(Abundance ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(1, 2, 3), ])
With given head(WQAbundancebyGuzzler)
lm(Abundance ~ Phospates + Nitrates + pH,
data=WQAbundancebyGuzzler[WQAbundancebyGuzzler$Abundance %in% c(29, 43, 65), ])
results in
# Call:
# lm(formula = Abundance ~ Phospates + Nitrates + pH, data = WQAbundancebyGuzzler
# [WQAbundancebyGuzzler$Abundance %in%
# c(29, 43, 65), ])
#
# Coefficients:
# (Intercept) Phospates Nitrates pH
# -7.00 36.00 -73.33 NA
I'm using the 'metafor' package in R to perform log response ratios. Some of my means are zero, which seems to be the cause of a warning after my escalc command (since log(0) is -inf). The metafor package provides a method of adding a small value to zero to avoid this. The documentation states:
"Cell entries with a zero can be problematic especially for the relative risk and the odds ratio. Adding a small constant to the cells of the 2 × 2 tables is a common solution to this problem [...] When to = "only0", the value of add is added to each cell of the 2 × 2 tables only in those tables with at least one cell equal to 0."
For some reason this is not resolving my error, perhaps because my data is not a 2x2 table? (It is output from summarise with ddply from the ply package, similar to the formatting in this example). Must I replace the zero values with a small number manually or is there a more elegant way? (Note that in this example the rows with zero also have sample size of 1 and thus no variance and will be dropped from the analysis anyway. I just want to know how this works for the future).
Reproducible example:
dat<-dput(Bin_Y_count_summary_wide)
structure(list(Species.ID = c("CAFERANA", "TR11", "TR118", "TR500",
"TR504", "TR9", "TR9_US1"), Y_num_mean.early = c(2, 147.375,
4.5, 0.5, 12.5, 93.4523809523809, 5), N.early = c(1L, 4L, 2L,
4L, 4L, 7L, 2L), sd.early = c(NA, 174.699444284558, 6.36396103067893,
1, 22.4127939653523, 137.506118190001, 7.07106781186548), se.early = c(NA,
87.3497221422789, 4.5, 0.5, 11.2063969826762, 51.9724274972283,
5), Y_num_mean.late = c(0, 3.625, 2.98482142857143, 0.8, 3, 47.2,
0), N.late = c(1L, 4L, 7L, 10L, 10L, 8L, 1L), sd.late = c(NA,
7.25, 5.10407804830748, 1.75119007154183, 8.03118920210451, 40.7351024477486,
NA), se.late = c(NA, 3.625, 1.9291601697265, 0.553774924194538,
2.53968501984006, 14.4020335865659, NA), Y_num_mean.wet = c(NA,
71.5, 0, 12, 27, 0, NA), N.wet = c(NA, 2L, 1L, 2L, 2L, 2L, NA
), sd.wet = c(NA, 17.6776695296637, NA, 9.89949493661167, 38.1837661840736,
0, NA), se.wet = c(NA, 12.5, NA, 7, 27, 0, NA)), row.names = c(NA,
7L), .Names = c("Species.ID", "Y_num_mean.early", "N.early",
"sd.early", "se.early", "Y_num_mean.late", "N.late", "sd.late",
"se.late", "Y_num_mean.wet", "N.wet", "sd.wet", "se.wet"), class = "data.frame", reshapeWide = structure(list(
v.names = c("Y_num_mean", "N", "sd", "se"), timevar = "early_or_late",
idvar = "Species.ID", times = c("early", "late", "wet"),
varying = structure(c("Y_num_mean.early", "N.early", "sd.early",
"se.early", "Y_num_mean.late", "N.late", "sd.late", "se.late",
"Y_num_mean.wet", "N.wet", "sd.wet", "se.wet"), .Dim = c(4L,
3L))), .Names = c("v.names", "timevar", "idvar", "times",
"varying")))
# Warning produced from this command
test <- escalc(measure="ROM", m1i=Y_num_mean.early, sd1i=sd.early, n1i=N.early, m2i=Y_num_mean.late, sd2i=sd.late, n2i=N.late, data=dat, add=1/2, to="only0")
The paragraph you are quoting applies to measures that one can calculate based on 2x2 tables (i.e., RR, OR, RD, AS, and PETO). The add and to arguments do not have any effect for measures such as SMD and ROM.
The only way you can get a mean of 0 for a ratio scale variable (which is what use of response ratios assumes) is if every value is equal to 0. Therefore, by definition, the variance must also be 0. This applies whether the sample size is 1 (in which case the variance is of course also 0) or whether you have a larger sample size.
In general, whenever at least one of the two means is 0, one cannot calculate the log response ratio. Of course, one could start adding some kind of constant to the means manually (and the same for the SDs), but this seems rather arbitrary. The adjustments we can do to counts in 2x2 tables are motivated based on statistical theory (those adjustments are actually bias reductions, which also happen to make the calculation of certain measures possible when there is a 0 count).