Repeated Measures ANOVA in R? - r

I am looking at average home range size on two sites (one that has undergone habitat restoration and the other is an experimental control) during three phases of the restoration process (before, during, and two years after). I am wanting to see if differences in mean home range size differ across sites and periods. Based on having two categorical variables (site and period), I assume this would be done using a repeated measures ANOVA? I was needing to see what code would be used since I have never done an ANOVA in R before.
rm (list = ls())
hrdata=read.csv(xxx)
hrdata

I think you could do this with a linear model, but see (https://stats.stackexchange.com/questions/20002/regression-vs-anova-discrepancy-aov-vs-lm-in-r) for a discussion of anova vs regression.
the code would look something like this:
lm1 <- lm(HRS ~ Site * Period, data=hrdata)
The first bit of this code is simply storing this linear model (lm) as an R object, which we have named lm1.
then you can do:
summary(lm1)
This would look at the effects of Site (habitat restoration vs control), Period (before, during, and after), and the interaction between the two.
There are lots of posts about interpreting these summary results. I have posted some below. This first one may be useful if you aren't sure how to interpret the interaction terms:
https://stats.stackexchange.com/questions/56784/how-to-interpret-the-interaction-term-in-lm-formula-in-r
https://stats.stackexchange.com/questions/59250/how-to-interpret-the-output-of-the-summary-method-for-an-lm-object-in-r
https://stats.stackexchange.com/questions/115304/interpreting-output-from-anova-when-using-lm-as-input

Related

Singularity in Linear Mixed Effects Models

Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?

Create a new datafram to do piecewise linear regression on percentages after doing serial crosstabs in R

I am working with R. I need to identify the predictors of higher Active trial start percentage over time (StartDateMonthsYrs). I will do linear regression with Percent.Active as the dependent variable.
My original dataframe is attached and my obtained Active trial start percentage over time (named Percent.Activeis presented here.
So, I need to assess whether federal sponsored trials, industry sponsored trials or Other sponsored trials were associated with higher active trial start percentage over time. I have many other variables that I wneed to assess but this is the sample of my data.
I am thinking to do many crosstabs for each variable (eg Fedral & Active then Industry & Active..etc.) in each month (may be with help of lapply then accumulate the obtained percentages data in the second sheet then run the analysis based on that.
My code for linear regression is as follow:
q.lm0 <- lm(Percent.Active ~ Time.point+ xyz, data.percentage);summary(q.lm0)
I'm a little bit confused. You write 'associated'. If you really want to look for association then yeah, a crosstab might be possible, and sufficient, as association is not the same as causation (which is further derived from correlation, if there is a theory behind). If you look for correlation, and insights over time, doing a regression with the lm package is not useful.
If you want to look for a regreesion type analysis there are packages in R like the plm package, which can deal with panel data, as you clearly have panel data (time points, and interested trials labels, and repetitive time points for these labels). Look at this post for infos about the package:https://stackoverflow.com/questions/2804001/panel-data-with-binary-dependent-variable-in-r
I'm writing you this because your Percent.Activevariable is only a binary outcome of 0/1 I'm not sure if this is on purpose. However, even if your outcome is not binary, the plm package might help, but you will find other mentioned packages in that post.

lmer or binomial GLMM [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I am running a mixed model in R. However I am having some difficulty understanding the type of model I should be running for the data that I have.
Let's call the dependant variable the number of early button presses in a computerised experiment. An experiment is made up of multiple trials. In each trial a participant has to press a button to react to a target appearing on a screen. However they may press the button too early and this is what is being measured as the outcome variable. So for example, participant A may have in total 3 early button presses in an experiment across trials whereas participant B may have 15.
In a straightforward linear regression model using the lm command in R, I would think this outcome is a continuous numerical variable. As well... its a number that participants score on in the experiment. However I am not trying to run a linear regression, I am trying to run a mixed model with random effects. My understanding of a mixed model in R is that the data format that the model takes from should be structured to show every participant by every trial. When the data is structured like this at trial level suddenly I have a lot of 1s and 0s in my outcome column. As of course at a trial level participants may accidently press the button too early scoring a 1, or not and score a 0.
Does this sound like something that needs to be considered as categorical. If so would it then be looked at through the glmer function with family set to binomial?
Thanks
As started by Martin, this question seems to be more of a cross-validation question. But I'll throw in my 2 cents here.
The question often becomes what you're interested in with the experiment, and whether you have cause to believe that there is a random effect in your model. In your example you have 2 possible effects that could be random: The individuals and the trials. In classical random-effect models the random effects are often chosen based on a series of rule-of-thumbs such as
If the parameter can be thought of as random. This often refers to the levels changing within a factor. In this situation both individuals and the trials are likely to change between experiments.
If you're interested in the systematic effect (eg. how much did A affect B) then the effect is not random and should be considered for the fixed effects. In your case, it is really only relevant if there are enough trials to see a systematic effects across individuals, but one could then question how relevant this effect would be for generalized results.
Several other rule-of-thumbs exist out there, but this at least gives us a place to start. The next question becomes which effect we're actually interested in. In your case it is not quite clear, but it sounds like you're interested in one of the following.
How many early button presses can we expect for any given trial
How many early button presses can we expect for any given individual
How big is the chance that an early button press happen during any given trial
For the first 2, you can benefit from averaging over either individual or trial and using a linear mixed effect model with the counter part as random effect. Although I would argue that a poisson generalized linear model is likely a better fit, as you are modelling counts that can only be positive. Eg. in a rather general sense use:
#df is assumed contain raw data
#1)
df_agg <- aggregate(. ~ individual, data = df)
lmer(early_clicks ~ . - individual + (1 | individual)) #or better: glmer(early_clicks ~ . - individual + (1 | individual), family = poisson, data = df_agg)
#2)
df_agg <- aggregate(. ~ trial, data = df)
lmer(early_clicks ~ . - trial+ (1 | trial)) #or better: glmer(early_clicks ~ . - trial+ (1 | trial), family = poisson, data = df_agg)
#3)
glmer(early_clicks ~ . + (1 | trial) + (1 | individual), family = binomial, data = df)
Note that we could use 3) to get answers for 1) and 2) by using 3) to predict probabilities and use these to find the expected early_clicks. However one can show theoretically that the estimation methods used in linear mixed models are exact, while this is not possible for generalized linear models. As such the results may differ slightly (or quite substantially) between all models. Especially in 3) the number of random effects may be quite substantial compared to the number of observations, and in practice may be impossible to estimate.
Disclaimer
I have only very briefly gone over some principals, and while they may be a very brief introduction they are by no means exhaustive. In the last 15 - 20 years the theory and practical side of mixed effect models has been extended substantially. If you'd like more information about mixed effect models I'd suggest starting at the glmm faq side by ben bolker (and others) and the references listed within there. For estimation and implementations I suggest reading the vignettes of the lme4, glmmTMB and possibly merTools packages. glmmTMB being a more recent and interesting project.

How can I include repeated measures to my lmer correctly

In my study I was sampling the same sites in different regions for many years. Each site has different properties in each year, which is important for my research question. I want to know, if the properties of the site affect biodiversity on the sites. And I am interested in the interaction of the propterties and the regions.
Overview:
Biodiversity = response
Site property = fixed factor, changes every year
Region = fixed factor , same regions every year
Site = random effect, is repeatedly sampled in the different sampling years
Year = random effect, is the factor in which "site" is repeated
At the moment my model looks like this:
mod1 <- lmer(biodiversity~region*siteProperty+(1|Year)+(1|site))
I'm not sure if this accounts for the repeated measures.
Alternatively I was thinking about this one, as it includes also the nestedness of the sites in the years, but maybe that is not necessary:
mod2 <- lmer(biodiversity~region*siteProperty+(1|Year)+(1|Year:site))
The problem with this approach is, that it only works if my site properties are not zero. But I have zeros in different properties and I need to analyse their effects as well.
If you need more information, just ask me for.
Thanks for your help!
Your first example,
mod1 <- lmer(biodiversity~region*siteProperty+(1|Year)+(1|site))
should be fine (although I'd encourage you to use the data argument explicitly ...). If you have samples for "many years" for each site, you might want to consider
including a trend model (i.e. include Year, as a numeric variable, in the fixed effects part of the model as well, either as a simple linear term or as part of an additive model, e.g. using splines::ns
checking for/allowing for autocorrelation (although this is tricky in lme4; you can use lme, but then crossed random effects of Year and site become harder).
If you have one sample per site/year combination, you don't want (1|Site:year), because that will be the same as the residual variability term.
Your statement "only works if my site properties are not zero" doesn't make sense to me: in general, having zeros in predictor variables shouldn't cause any problems ... ?

R - Linear Regression - Control for a variable

I have a computer science background & I am trying to teach myself data science by solving the problems available on the internet
I have a smallish data set which has 3 variables - race, gender and annual income. There are about 10,000 sample observations. I am trying to predict income from race & gender.
I have divided the data into 2 parts - one for each gender & now I am trying to create 2 regression models. Is this possible in R? Can some one provide example syntax.
You don't specify how your data are stored or how the variable race is recorded (is it a factor?)
[If you're just fitting income against race for males, say, and you had the male income and race in income.m and race.m and if the second was a factor in R, then lm(income.m~race.m) will fit the line for males (use summary on the resulting object to get information about it). You could do something similar for females. But most people won't fit the models this way.]
If you're prepared to assume that the variation about the lines is the same for both genders, you can fit both lines with one model.
This has several advantages over analyzing the lines separately, though that can also be done.
If gender is either a factor or a numeric variable recorded as (0/1), and race is a factor and you have the data in a data frame (called, for example, incdata), then you'd fit both lines at once with:
lm(income~race*gender, data=incdata)
which is R shorthand for
lm(income~race+gender+race:gender, data=incdata)
where race:gender is an interaction term.
If you further assume that the effect of race is the same for both sexes, then the smaller model:
lm(income~race+gender, data=incdata)
would be used instead. This would often be the model people would fit if asked to 'control for gender', though many would consider the interaction model I mentioned before instead.
I'd strongly advise working on more simple regression problems first, with a textbook or set of notes suitable for guiding you through the ideas.
If you haven't already fitted a regression in R, I'd start with a smaller data set that only has a single predictor just to get used to the basic mechanics.
R comes with many data sets already built in. See, for example, library(help=datasets) which has about 80 data sets; some of the packages that come with R have more (MASS has over 80, for example). Many R packages on CRAN are packed with data sets, many suitable for regression.
For example, the cars data set (see ?cars in R) records the stopping distance of cars, given their speed. You don't need to read the data in, it's already there.
A simple linear regression (not necessarily the best model given some understanding of physics, but just about adequate for the data) would be:
lm(dist~speed, cars)
Again, you use summary to examine it. e.g. (I suggest you type these one at a time):
carsfit <- lm(dist~speed, cars)
summary(carsfit)
plot(dist~speed, cars)
abline(carsfit, col=2)
The examples in the help on the cars data set (?cars) gives several other models and plots. You might try those one at a time also.
The car package (CAR is short for "Companion to Applied Regression") has many small data sets specifically for regression.
It is very simple.
fit1 <- lm(income~gender+race,data=Dataframe1)
summary(fit1)
I would not recommend using two dataframes. Unless you are using more advanced statistical methods that require using two dataframes. Just use your gender variable.
Also, check this site out: http://www.statmethods.net/stats/regression.html
You could indeed do so Abhi but I believe your question is very broad.
(1) you could predict income from race and gender. This can be done in various ways but the most common would perhaps be "regression analysis". I suggest you do some searches on the internet on that topic. Answering what kind of regression and how to perform it is a matter of situation. You would probably find out yourself after reading about regression.
(2) R can do that. But i suggest you do some reading about regression before you get into R.
(3) If I were to analyze if race and gender can predict income I would simply do a linear regression where income would be the dependent variable and race and sex would be independent (predictors). This can be done by the "lm" function in R.
Or did I misunderstand something here?
Regards
You need to do some reading on Linear/Multiple Regression techniques. Not sure why you divide data into 2 groups based on gender. Random split the data into Train and Test, so that you can model on Train and Validate on test.

Resources