I am quite new in R and (I admit it!) not so good with statistics, so I am sorry if my problem is too trivial, but I would really appreciate some hints on the matter.
I have 9 points (plots) of soil humidity measurements for each of the 2 different plantation systems we have (agriforestry and agriculture) over 2 months (weekly measurements). We also have the distance in meters between the closest tree (bigger than 5cm DBH) and the exact measurement point in each of the plots (varying between 4.2 and 12m in Agriforestry and are 50m in agriculture). Therefore, I have a profile of humidity (y) over time (x) (that behave similarly but vary due to weather fluctuations) for each of the 18 plots (9 in agriforestry and 9 in agriculture). What I need to know is:
Are these variations in humidity between the measurement points over time dependent on (or influenced by) the distance of the trees? Meaning, do the trees hold more water or take more water from the soil if they are closer to the measurement points (that are in the middle of a plantation?
Are these curves (humidity x time) significantly different from each other?
I thought first about grouping every 3 points of tree measurements (smaller distances from trees, medium distances and higher distances) for the agriforestry system and all 9 from agroforestry and using them as replications, as they behave more similarly. However it confounded me a bit.
So... I got as far as thinking about using a repeated measure ANOVA from the ez package. So in this case I had:
str(SanPedro)
data.frame': 450 obs. of 6 variables:
Parcel : Factor w/ 2 levels "Forest","Agriculture": 1 1 1 1 1 1 1 1 1 1 ...
Distance: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
Plot : num 1 1 1 1 1 1 1 1 1 1 ...
Date : Date, format: "0011-07-20" "0011-07-24" ...
Humidity: num 0.217 0.205 0.199 0.2 0.192 0.181 0.184 0.18 0.179 0.178 ...
Number : num 1 2 3 4 5 6 7 8 9 10 ..
When I tried to run the ezANOVA as
ezANOVA(data=SanPedro, dv=Humidity, wid=Number, within=Parcel, between=Plot, type=1, return_aov=TRUE)
I got this:
Warning: Converting "Number" to factor for ANOVA.
Warning: "Plot" will be treated as numeric.
Error in ezANOVA_main(data = data, dv = dv, wid = wid, within = within, :
One or more cells is missing data. Try using ezDesign() to check your data.
If I check the ezDesign(SanPedro), I get:
ezDesign(SanPedro)
Error in as.list(c(x, y, row, col)) :
argument "x" is missing, with no default
In the end, I do not really understand the problem with the data, and I am not even sure if the ezANOVA is actually the right analysis for my case... I really deeply appreciate any hints and ideas on the matter!!! Thanks a loooot!!! =)
Related
I'm analyzing data from an experiment, replicated in time, where I measured plant emergence at the soil surface. I had 3 experimental runs, represented by the term trialnum, and would like to include trialnum as a random effect.
Here is a summary of variables involved:
data.frame: 768 obs. of 9 variables:
$ trialnum : Factor w/ 2 levels "2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Flood : Factor w/ 4 levels "0","5","10","15": 2 2 2 2 2 2 1 1 1 1 ...
$ Burial : Factor w/ 4 levels "1.3","2.5","5",..: 3 3 3 3 3 3 4 4 4 4 ...
$ biotype : Factor w/ 6 levels "0","1","2","3",..: 1 2 3 4 5 6 1 2 3 4 ...
$ soil : int 0 0 0 0 0 0 0 0 0 0 ...
$ n : num 15 15 15 15 15 15 15 15 15 15 ...
Where trialnum is the experimental run, Flood, Burial, and biotype are input/independent variables, and soil is the response/dependent variable.
I previously created this model with all input variables:
glmfitALL <-glm(cbind(soil,n)~trialnum*Flood*Burial*biotype,family = binomial(logit),total)`
From this model I found that by running
anova(glmfitALL, test = "Chisq")
trialnum is significant. There were 3 experimental runs, I'm only including 2 of those in my analysis. I have been advised to incorporate trialnum as a random effect so that I do not have to report the experimental runs separately.
To do this, I created the following model:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype + (1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
From this I get the following error message:
maxfun < 10 * length(par)^2 is not recommended. Unable to evaluate scaled gradientModel failed to converge: degenerate Hessian with 9 negative eigenvalues
I have tried running this model in a variety of ways including:
glmerfitALL <-glmer(cbind(soil,n)~Flood*Burial*biotype*(1|trialnum),
data = total,
family = binomial(logit),
control = glmerControl(optimizer = "bobyqa"))
as well as incorporating REML=FALSE and used optimx in place of bobyqa, but all reiterations resulted in a similar error message.
Because this is an "eigenvalue" error, does that mean there is a problem with my source file/original data?
I also found previous threads regarding the lmer4 error messages (sorry I did not save the link), and saw some comments raising issue with the lack of replicates of the random effect. Because I only have 2 replicates trialnum2 and trialnum3, am I able to even run trialnum as a random effect?
Regarding the eigenvalue, the chief recommendation for this is centring and/or scaling predictors.
Regarding the RE groups, around five are an approximate minimum.
I have a quite complex question of how to write an R code, as I went through all the possible options known to me without any results. I can do the first part in excel within less than one hour, i.e., to get a result for the following formula.
C = 1/n ∑ nj(cj - aj)^2
n = number of observations, nj = number of observations in the interval, aj = proportion of correct responses for each class interval, ∑ is the sum of each j ratings to class interval.
The reason I want to do it in R is because (1) everything can be done once and with the same program, and--more important--(2) I need do jackknife (or bootstrap) this formula to get the jackknife SEs (which I need for the inferential confidence intervals).
This is the data
> str(data)
'data.frame': 372 obs. of 6 variables:
$ ID : int 120 432 664 163 283 659 78 150 158 188 ...
$ age : int 20 20 16 18 20 15 20 20 16 20 ...
$ sex : Factor w/ 2 levels "female","male": 1 1 1 2 2 2 1 1 1 2 ...
$ poconf123 : int 1 1 1 1 1 1 1 1 1 1 ...
$ PoConfPC : int 40 50 40 30 10 50 50 30 40 30 ...
$ idacc : Factor w/ 2 levels "accurate","inaccurate": 1 1 1 1 1 1 2 2 2 2 ...
To get the nj, cj and aj, I tried the following code
group_by(data, poconf123) %>% #this is to get results for each class intervall
summarise(poconf = mean(PoConfPC)) %>% #this is to get cj
summarise(NumAcc=n(), prop = mean(idacc==1)/n()) #this is to get nj and aj
Using the code above I get the following error message
Error in summarise_impl(.data, dots):
Evaluation error: object 'PoConfPC' not found.
The first two lines alone, or the first and the last lines alone do not produce the same error. However, the code
group_by(data, poconf123) %>%
summarise(NumAcc=n(), prop = mean(idacc==1)/n())
still seems not optimal, because instead of getting the proportions e.g., .53, I am getting the values "0." in the entire column 'prop'.
Of course, I could ask for a proportion table, e.g.,
prop.table(table(data$idacc))
to get the proportions for both accurate and inaccurate but this is not really what I need.
So, overall I do not even get to the point where I can run the analyses and get a single C, not to mention the jackknife SE. And this brings me to the actual question, that is, how I can program this formula into a function, so I can 'jackknife' it.
Below is an example of how to jackknife the point-biserial correlation with the outputs required for inferential confidence intervals.
library(bootstrap)
xx = data$PoConfPC
yy = data$idacc
nn = length(xx)
theta <- function(x,xx,yy) {biserial.cor(xx[x], yy[x], level = 1)}
results <- jackknife(1:nn, theta, xx, yy)
summary(results$jack.values); results$jack.se; results$jack.bias
I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.
I'm trying to measure the biological impacts of an industrial development using a Before-After-Gradient approach. I am using a linear mixed model approach in R, and am having trouble specifying an appropriate model, especially the random effects. I've spent a lot of time researching this, but so far haven't come up with a clear solution--at least not one that I understand. I am new to LMM (and R for that matter) so would welcome any advice.
The response variables (for example, changes in abundance of key species) will be measured as a function of distance from the edge of impact, using plots established at fixed distances along multiple transects ("gradients") radiating out from the edge of the disturbance. Ideally, each plot would be sampled at multiple times both before and after the impact; however, for simplicity I'm starting by assuming the simplest case, where each plot is sampled once before and once after the impact. Assume also that the individual gradients are far enough apart that they can be considered spatially independent.
First, some simulated data. The effect here is linear instead of curvilinear, but you get the idea.
> str(bag)
'data.frame': 30 obs. of 5 variables:
$ Plot : Factor w/ 15 levels "G1-D0","G1-D100",..: 1 2 4 5 3 6 7 9 10 8 ...
$ Gradient: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 2 2 2 2 2 ...
$ Distance: Factor w/ 5 levels "0","100","300",..: 1 2 3 4 5 1 2 3 4 5 ...
$ Period : Factor w/ 2 levels "After","Before": 2 2 2 2 2 2 2 2 2 2 ...
$ response: num 0.633 0.864 0.703 0.911 0.676 ...
> bag
Plot Gradient Distance Period response
1 G1-D0 1 0 Before 0.63258749
2 G1-D100 1 100 Before 0.86422356
3 G1-D300 1 300 Before 0.70262745
4 G1-D700 1 700 Before 0.91056851
5 G1-D1500 1 1500 Before 0.67637353
6 G2-D0 2 0 Before 0.75879579
7 G2-D100 2 100 Before 0.77981992
8 G2-D300 2 300 Before 0.87714158
9 G2-D700 2 700 Before 0.62888739
10 G2-D1500 2 1500 Before 0.83217617
11 G3-D0 3 0 Before 0.87931801
12 G3-D100 3 100 Before 0.81931761
13 G3-D300 3 300 Before 0.74489963
14 G3-D700 3 700 Before 0.68984485
15 G3-D1500 3 1500 Before 0.94942006
16 G1-D0 1 0 After 0.00010000
17 G1-D100 1 100 After 0.05338171
18 G1-D300 1 300 After 0.15846741
19 G1-D700 1 700 After 0.34909588
20 G1-D1500 1 1500 After 0.77138824
21 G2-D0 2 0 After 0.00010000
22 G2-D100 2 100 After 0.05801157
23 G2-D300 2 300 After 0.11422562
24 G2-D700 2 700 After 0.34208601
25 G2-D1500 2 1500 After 0.52606733
26 G3-D0 3 0 After 0.00010000
27 G3-D100 3 100 After 0.05418663
28 G3-D300 3 300 After 0.19295391
29 G3-D700 3 700 After 0.46279103
30 G3-D1500 3 1500 After 0.58556186
As far as I can tell, the fixed effects should be Period (Before,After) and Distance, treating distance as continuous (not a factor) so we can estimate the slope. The interaction between Period and Distance (equivalent to the difference in slopes, before vs. after) measures the impact. I'm still scratching my head over how to specify the random effects. I assume I should control for variation among gradients, as follows:
result <- lme(response ~ Distance + Period + Distance:Period, random=~ 1 | Gradient, data=bag)
However, I suspect I may be missing some source of variation. For example, I'm not sure the above model controls for the re-sampling of individual plots before and after. Any suggestions?
With one sample / gradient, as you have, there's no need to specify random effects or anything about the gradients. You can do this with a straight multiple regression. Once you have multiple measures in each gradient then you can use the model you've specified. Which is that there's an expected main effect of gradient on the intercept of the model but that the effects (slopes) of Distance, Period, and their interactions, should be fixed.
You could specify additional random effects if you expect there to be an appreciable amount of variability among gradients in your other predictors. I'm not sure how you do it in lme, or even if you can, but in lmer an example might be:
lmer(response ~ Distance * Distance:Period + (1 + Distance | Gradient), data=bag)
That would allow the Distance slope to have a fixed effect component and one that varied with gradient. You can look up further specification of random effects but hopefully you see the general idea and then you can decide how complex to make your model.
The question I am posting here is closely linked to another question I posted two days ago about gompertz aging analysis.
I am trying to construct a survival object, see ?Surv, in R. This will hopefully be used to perform Gompertz analysis to produce an output of two values (see original question for further details).
I have survival data from an experiment in flies which examines rates of aging in various genotypes. The data is available to me in several layouts so the choice of which is up to you, whichever suits the answer best.
One dataframe (wide.df) looks like this, where each genotype (Exp, of which there is ~640) has a row, and the days run in sequence horizontally from day 4 to day 98 with counts of new deaths every two days.
Exp Day4 Day6 Day8 Day10 Day12 Day14 ...
A 0 0 0 2 3 1 ...
I make the example using this:
wide.df2<-data.frame("A",0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2)
colnames(wide.df2)<-c("Exp","Day4","Day6","Day8","Day10","Day12","Day14","Day16","Day18","Day20","Day22","Day24","Day26","Day28","Day30","Day32","Day34","Day36")
Another version is like this, where each day has a row for each 'Exp' and the number of deaths on that day are recorded.
Exp Deaths Day
A 0 4
A 0 6
A 0 8
A 2 10
A 3 12
.. .. ..
To make this example:
df2<-data.frame(c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),c(0,0,0,2,3,1,3,4,5,3,4,7,8,2,10,1,2),c(4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36))
colnames(df2)<-c("Exp","Deaths","Day")
Each genotype has approximately 50 flies in it. What I need help with now is how to go from one of the above dataframes to a working survival object. What does this object look like? And how do I get from the above to the survival object smoothly?
After noting the total of Deaths was 55 and you said that the number of flies was "around 50", I decided the likely assumption was that this was a completely observed process. So you need to replicate the duplicate deaths so there is one row for each death and assign an event marker of 1. The "long" format is clearly the preferred format. You can then create a Surv-object with the 'Day' and 'event'
?Surv
df3 <- df2[rep(rownames(df2), df2$Deaths), ]
str(df3)
#---------------------
'data.frame': 55 obs. of 3 variables:
$ Exp : Factor w/ 1 level "A": 1 1 1 1 1 1 1 1 1 1 ...
$ Deaths: num 2 2 3 3 3 1 3 3 3 4 ...
$ Day : num 10 10 12 12 12 14 16 16 16 18 ...
#----------------------
df3$event=1
str(with(df3, Surv(Day, event) ) )
#------------------
Surv [1:55, 1:2] 10 10 12 12 12 14 16 16 16 18 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "time" "status"
- attr(*, "type")= chr "right"
Note: If this were being done in the coxph function, the expansion to individual lines of date might not have been needed, since that function allows specification of case weights. (I'm guessing that the other regression function in the survival package would not have needed this to be done either.) In the past Terry Therneau has expressed puzzlement that people are creating Surv-objects outside the formula interface of the coxph. The intended use of htis Surv-object was not described in sufficient detail to know whether a weighted analysis without exapnsion were possible.