I am trying to follow the tutorial by Datanovia for Two-way repeated measures ANOVA.
A quick overview of my dataset:
I have measured the number of different bacterial species in 12 samplingsunits over time. I have 16 time points and 2 groups. I have organised my data as a tibble called "richness";
# A tibble: 190 x 4
id selection.group Day value
<fct> <fct> <fct> <dbl>
1 KRH1 KR 2 111.
2 KRH2 KR 2 141.
3 KRH3 KR 2 110.
4 KRH1 KR 4 126
5 KRH2 KR 4 144
6 KRH3 KR 4 135.
7 KRH1 KR 6 115.
8 KRH2 KR 6 113.
9 KRH3 KR 6 107.
10 KRH1 KR 8 119.
The id refers to each sampling unit, and the selection group is of two factors (KR and RK).
richness <- tibble(
id = factor(c("KRH1", "KRH3", "KRH2", "RKH2", "RKH1", "RKH3")),
selection.group = factor(c("KR", "KR", "KR", "RK", "RK", "RK")),
Day = factor(c(2,2,4,2,4,4)),
value = c(111, 110, 144, 92, 85, 69)) # subset of original data
My tibble appears to be in an identical format as the one in the tutorial;
> str(selfesteem2)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 72 obs. of 4 variables:
$ id : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ treatment: Factor w/ 2 levels "ctr","Diet": 1 1 1 1 1 1 1 1 1 1 ...
$ time : Factor w/ 3 levels "t1","t2","t3": 1 1 1 1 1 1 1 1 1 1 ...
$ score : num 83 97 93 92 77 72 92 92 95 92 ..
Before I can run the repeated measures ANOVA I must check for normality in my data. I copied the framework proposed in the tutorial.
#my code
richness %>%
group_by(selection.group, Day) %>%
shapiro_test(value)
#tutorial code
selfesteem2 %>%
group_by(treatment, time) %>%
shapiro_test(score)
But get the error message "Error: Column variable is unknown" when I try to run the code. Does anyone know why this happens?
I tried to continue without insurance that my data is normally distributed and tried to run the ANOVA
res.aov <- rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
But get this error message; Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
I have checked for NA values with any(is.na(richness)) which returns FALSE. I have also checked table(richness$selection.group, richness$Day) to be sure my setup is correct
2 4 6 8 12 16 20 24 28 29 30 32 36 40 44 50
KR 6 6 6 6 6 6 6 6 6 6 6 5 6 6 6 6
RK 6 6 6 6 6 5 6 6 6 6 6 6 6 6 6 6
And the setup appears correct. I would be very grateful for tips on solving this.
Best regards Madeleine
Below is a subset of my dataset in a reproducible format:
library(tidyverse)
library(rstatix)
library(tibble)
richness_subset = data.frame(
id = c("KRH1", "KRH3", "KRH2", "RKH2", "RKH1", "RKH3"),
selection.group = c("KR", "KR", "KR", "RK", "RK", "RK"),
Day = c(2,2,4,2,4,4),
value = c(111, 110, 144, 92, 85, 69))
richness_subset$Day = factor(richness$Day)
richness_subset$selection.group = factor(richness$selection.group)
richness_subset$id = factor(richness$id)
richness_subset = tibble::as_tibble(richness_subset)
richness_subset %>%
group_by(selection.group, Day) %>%
shapiro_test(value)
# gives Error: Column `variable` is unknown
res.aov <- rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
# gives Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
# 0 (non-NA) cases
I create something like the design of your data:
set.seed(111)
richness = data.frame(id=rep(c("KRH1","KRH2","KRH3"),6),
selection.group=rep(c("KR","RK"),each=9),
Day=rep(c(2,4,6),each=3,times=2),value=rpois(18,100))
richness$Day = factor(richness$Day)
richness$id = factor(richness$id)
First, shapiro_test, there's a bug in the script and the value you wanna test cannot be named "value":
# gives error Error: Column `variable` is unknown
richness %>% shapiro_test(value)
#works
richness %>% mutate(X = value) %>% shapiro_test(X)
# A tibble: 1 x 3
variable statistic p
<chr> <dbl> <dbl>
1 X 0.950 0.422
1 X 0.963 0.843
Second, for the anova, this works for me.
rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
In my example every term can be estimated.. What I suspect is that one of your terms is a linear combination of the other. Using my example,
set.seed(111)
richness =
data.frame(id=rep(c("KRH1","KRH2","KRH3","KRH4","KRH5","KRH6"),3),
selection.group=rep(c("KR","RK"),each=9),
Day=rep(c(2,4,6),each=3,times=2),value=rpois(18,100))
richness$Day = factor(richness$Day)
richness$id = factor(richness$id)
rstatix::anova_test(
data = richness, dv = value, wid = id,
within = c(selection.group, Day)
)
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
Gives the exact same error. This can be checked using:
lm(value~id+Day:selection.group,data=richness)
Call:
lm(formula = value ~ id + Day:selection.group, data = richness)
Coefficients:
(Intercept) id1 id2
101.667 -3.000 -6.000
id3 id4 id5
-6.000 1.889 11.556
Day2:selection.groupKR Day4:selection.groupKR Day6:selection.groupKR
1.667 -12.000 9.333
Day2:selection.groupRK Day4:selection.groupRK Day6:selection.groupRK
-1.667 NA NA
The Day4:selection.groupRK and Day6:selection.groupRK are not estimateable because they are covered by a linear combination of factors before.
The solution for running the Shapiro_test proposed above worked.
And I figured out I have some linear combination by running lm(value~id+Day:selection.group,data=richness). However, I don't understand why? I know I have data points for each group (see graph). Where does this linear combination come from?
Repeated measure ANOVA appears so appropriate for me as I am following sampling units over time.
I had the same issue. Couldn't find out the solution. Finally the following works:
install “ez” package
newModel<-ezANOVA(data = dataFrame, dv = .(outcome variable), wid = .(variable that
identifies participants), within = .(repeated measures predictors), between = .
(between-group predictors), detailed = FALSE, type = 2)
Example: bushModel<-ezANOVA(data = longBush, dv = .(Retch), wid = .(Participant), within = .(Animal), detailed = TRUE, type = 3)
Related
So I have a dataframe and I want to create a new variable randomly using other factors; my data contains this key variables:
iQ
Age
Educ_y
5
23
15
4
54
17
2
43
6
3
13
7
5
14
8
1
51
16
I want to generate a new variable (years of experience) randomly using this creterias:
If Age >= 15 & Iq<= 2 so "Exp_y" takes a randome number between (Age-15)/2 and Age-15.
If (Age >= 15 & (Iq==3 | Iq==4) so "Exp_y" takes a randome number between (Age-Educ_y-6)/2 and (Age-Educ_y-6).
And 0 otherwise.
I tried using this code :
Df <- Df %>%
rowwise() %>%
mutate(Exep_y = case_when(
Age > 14 & iq <= 2 ~ sample(seq((Age-15)/2, Age-15, 1), 1),
Age > 14 & between(iq, 3, 4) ~ sample(seq((Age-Educ_y-6)/2, Age-Educ_y-6, 1), 1),
TRUE ~ 0
))
But I end up with this Error message:
Error in `mutate()`:
! Problem while computing `Exep_y = case_when(...)`.
i The error occurred in row 3.
Caused by error in `seq.default()`:
! signe incorrect de l'argument 'by'
Any ideas please;
Best Regards
This error message is occurring because the case_when() statement evaluates all the right-hand-side expressions, and then selects based on the left-hand-side.. Therefore, even though, for example row 4 of your sample dataset will default to TRUE~0, the RHS side of the the first two conditions also gets evaluated. In this case, the first condition's RHS is seq((13-15)/2,13-15,1), which returns an error, because in this case from = -1 and to = -2, so the by argument cannot be 1 (it is the wrong sign).
seq((13-15)/2, 13-15, 1)
Error in seq.default((13 - 15)/2, 13 - 15, 1) :
wrong sign in 'by' argument
You could do something like this:
f <- function(i,a,e) {
if(i>4 | a<15) return(0)
if(i<=2) return(sample(seq((a-15)/2, a-15),1))
return(sample(seq((a-e-6)/2, a-e-6),1))
}
Df %>% rowwise() %>% mutate(Exep_y=f(iq,Age,Educ_y))
Output:
iq Age Educ_y Exep_y
<int> <int> <int> <dbl>
1 5 23 15 0
2 4 54 17 16.5
3 2 43 6 21
4 3 13 7 0
5 5 14 8 0
6 1 51 16 27
You could try using if_else() rather than case_when:
Documentation can be found here: https://dplyr.tidyverse.org/reference/if_else.html
I try to fit a Lasso regression model using glmnet(). As I have never worked with Lasso regression before, I tried to get along with tutorials but when applying the model, it always results with the following error:
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,:
one multinomial or binomial class has 1 or 0 observations; not allowed
Working with the dataset from this question (https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome) it seems that the dependent variable, the y, has to consist only of 0 and 1. Whenever I set one of the observation values of y to 2 or anything else than 0 or 1, it results in this error.
This is my code:
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
x_vars <- as.matrix(data.frame(data$x1, data$x2, data$x3))
lasso_cv <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)
x_vars_2 <- model.matrix(data$y ~ data$x1 + data$x2 + data$x3)[, -1]
lasso_cv_2 <- cv.glmnet(x_vars, y=as.factor(data$y), alpha = 1, lambda = lambdas_to_try, family = "binomial", nfolds = 10)
And this is how my dataset looks like:
The problem is, that in my data, the y variable represents the number of crimes, so it has integer values between 0 and 1000. I cannot set the value to 0 and 1 only. How does it work to use these data to apply a Lasso regression?
As #Gregor noted, what you have is count data, and it should be regression and not classification. Using an example dataset, this is how you can implement it:
library(MASS)
library(glmnet)
data(Insurance)
Your response variable should be numeric:
str(Insurance)
'data.frame': 64 obs. of 5 variables:
$ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Group : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ...
$ Age : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
$ Holders : int 197 264 246 1680 284 536 696 3582 133 286 ...
$ Claims : int 38 35 20 156 63 84 89 400 19 52 ...
Now we set the predictors and response variables:
y = Insurance$Claims
X = model.matrix(Claims ~ .,data=Insurance)
Run a cv to find the best lambda (if you don't know your L1 norm):
fit = cv.glmnet(x=X,y=y,family="poisson")
pred = predict(fit,X,s=fit$lambda.1se)
The prediction is in log scale, so to compare with your actual
plot(log(y),pred,xlab="log (actual)",ylab="log (predicted)")
So I have a data set that is based on HR data training which asks tech and common questions.
The rows represent an employee and the columns represent the score they got on each question. The columns also include demographic data. I only want to see the row total of the tech and common questions though and not include the demographic data.
techs<-grep("^T",rownames(dat))
commons<-grep("^C",rownames(dat))
I used this to try to group the columns together but when I do:
total<-rowsum(commons,techs)
and try to put it in a linear regression:
Mod1Train<-lm(total~.,data=dat[Train,])
it says that there are different variable lengths.
I'm a super newbie to R, so sorry in advance if I'm really off.
in the future it would be ever so helpful if you provided a sample of your data. It's hard for us to help when we're guessing about that. Please see this link https://stackoverflow.com/help/minimal-reproducible-example.
Having said that LOL and realizing you're new I'll take a guess...
Let's make pretend data that I imagine is a smaller imaginary version of yours...
set.seed(2020)
emplid <- 1:10
gender <- sample(c("Male", "Female"), size = 10, replace = TRUE)
Tech1 <- sample(10:20, size = 10, replace = TRUE)
Tech2 <- sample(10:20, size = 10, replace = TRUE)
Tech3 <- sample(10:20, size = 10, replace = TRUE)
Common1 <- sample(10:20, size = 10, replace = TRUE)
Common2 <- sample(10:20, size = 10, replace = TRUE)
Common3 <- sample(10:20, size = 10, replace = TRUE)
Kathryn <- data.frame(emplid, gender, Tech1, Tech2, Tech3, Common1, Common2, Common3)
Kathryn
#> emplid gender Tech1 Tech2 Tech3 Common1 Common2 Common3
#> 1 1 Female 10 17 15 18 17 15
#> 2 2 Female 17 13 11 20 11 13
#> 3 3 Male 17 11 19 18 10 12
#> 4 4 Female 19 16 15 14 15 16
#> 5 5 Female 11 13 20 20 16 13
#> 6 6 Male 15 11 17 19 17 13
#> 7 7 Male 11 13 11 15 14 11
#> 8 8 Female 12 14 10 11 17 19
#> 9 9 Female 11 13 15 18 11 10
#> 10 10 Female 17 20 12 12 14 15
If you're new may want to invest some time learning the tidyverse which could make this simple like here Efficiently sum across multiple columns in R
Per your note in the comments, you have a pattern we can match for summing questions. You were close with your attempt at grep but we want the values back so we need value = TRUE which we'll store and make use of.
techqs <- grep(x = names(Kathryn), pattern = "^Tech", value = TRUE)
commonqs <- grep(x = names(Kathryn), pattern = "^Common", value = TRUE)
Kathryn$TechScores <- rowSums(Kathryn[,techqs])
Kathryn$CommonScores <- rowSums(Kathryn[,commonqs])
### Commented out how to do it manually.
# Kathryn$TechScores <- rowSums(Kathryn[,c("TQ1", "TQ2", "TQ3")])
# Kathryn$CommonScores <- rowSums(Kathryn[,c("CQ1", "CQ2", "CQ3")])
Kathryn$TotalScore <- Kathryn$TechScores + Kathryn$CommonScores
Now to regress which is where the statistical problem comes in. Are you really trying to predict the total score from the components??? That's not hard in r but it leads to silly answers.
Kathryn_model <- lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
summary(Kathryn_model)
#> Warning in summary.lm(Kathryn_model): essentially perfect fit: summary may be
#> unreliable
#>
#> Call:
#> lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.165e-14 -1.905e-15 9.290e-16 8.590e-15 1.183e-14
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.089e-14 6.345e-14 1.275e+00 0.243
#> TechScores 1.000e+00 9.344e-16 1.070e+15 <2e-16 ***
#> CommonScores 1.000e+00 1.130e-15 8.853e+14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.43e-14 on 7 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 9.875e+29 on 2 and 7 DF, p-value: < 2.2e-16
I don't understand your code and what you search for
rowsums don't make "a row total" but, quite on the contrary, adds rows between themselves. It returns a matrix, not a vector. Is that what you want ?
Otherwise, maybe you're looking for rowSums, which computes every rows totals of a matrix.
(by the way, if you need it, the matrix product is %*% in R)
Are you sure you have understood lm ?
In lm, there should be something like
lm(y~x,data=adataframe)
"adataframe" is the eventual dataframe/matrix where lm seeks both the response and the input variable,named "y" and "x" here. It is optional. If not found, y and x are seeked in the Global Env as if the columns names are not found in data, they are seeked in the Global environment. It is sometimes better however, to have such a matrix-like object, to avoid common errors.
So if you want to use lm, maybe you should first try to obtain 2 vectors, one for x and one for y, have them in a data.frame with 2 columns (x and y), and call the code above, if I have correctly understood
Note : if you want to remove the constant, use then
lm(y~x+0,data=adataframe)
I am trying to get a simple plot showing the time course of worry duration over 6 days for two groups. However, I get vertical lines instead of a line showing the time course.
This is what my data looks like:
> head(alldays_dur)
ParticipantID Session Day Time Worry_duration group
1 1 2 1 71804 15 intervention
2 1 4 1 56095 5 intervention
3 2 2 1 36739 15 intervention
4 2 4 1 45013 10 intervention
5 2 5 1 51026 5 intervention
This is the structure of my data
> str(alldays_dur)
'data.frame': 2620 obs. of 10 variables:
$ ParticipantID : num 113 113 113 113 113 113 113 113 113 113 ...
$ Session : num 9 10 11 12 14 15 16 21 22 24 ...
$ Day : Factor w/ 6 levels "1","2","3","4",..: 2 2 2 2 2 2 2 3 3
$ Time : num 37350 42862 47952 51555 61499 ...
$ Worry_duration: num 5 5 5 5 10 0 5 5 5 5 ...
$ group : Factor w/ 2 levels "Intervention group",..: 1 1 1 1 1 1
I have tried the following code:
p <- ggplot(alldays_dur, aes(x=Day, y=Worry_duration, group=1)) +
geom_line() +
labs(x = "Day",
y = "Mean worry duration in minutes per day")
print(p)
However, I get the following plot: plot
I have included the group=1 in the code after reading some earlier posts on this topic. However, it didn't help me as I had hoped.
Do you maybe have some useful tips for me? Thank you in advance.
Ps. I am sorry if the post is unclear in any way, this is my first time ever posting on stackoverflow, so I am not quite familiar with all the 'post-options' yet.
You need to summarize your data first, with ddply for example:
require(plyr) # ddply
require(ggplot2) # ggplot
# Creating dataset
raw_data = data.frame(Day = sample(c(1:6),100, replace = T),
group = sample(c("group_1", "group_2"),100, replace = T),
Worry_duration = sample(seq(0,30,5), 100, replace = T))
# Summarize
DF = ddply(raw_data, c("Day", "group"), summarize,
Worry_duration.mean = mean(Worry_duration, na.rm = T))
# Plot
ggplot(DF, aes(x = Day, y = Worry_duration.mean, group = group, color = group)) +
geom_line()+ xlab("Day") + ylab("Mean worry duration in minutes per day")
I have a data set with 20 variables. 10 of them are variables of great interest but these variables need to be adjusted for group differences in terms of age and sex. I do this by using regression, to predict values depending on age and sex.
There are many variables, and many persons, so I want a loop or similar.
Here is an example of what I'm attempting
# Load example data
library(survival)
library(dplyr)
data(lung) # example data
# I want to obtain adjusted values for the following two variables, called "dependents"
dependents <- names(select(lung, 7:8))
new_data <- lung # copies data set
for (i in seq_along(dependents)) {
eq <- paste(dependents[i],"~ age + sex")
fit <- lm(as.formula(eq), data= new_data)
new_data$predicted_value <- predict(fit, newdata=new_data, type='response')
new_data <- rename(new_data, paste(dependents[i], "_predicted", sep="") = predicted_value)
}
View(new_data)
This failed to provide me with the "dependents" in adjusted (i.e predicted) form.
Any ideas?
Thanks in advance
Here is an alternative approach, using the tidyr package and the augment function from my broom package:
library(tidyr)
library(broom)
new_data <- lung %>%
gather(dependent, value, ph.karno:pat.karno) %>%
group_by(dependent) %>%
do(augment(lm(value ~ age + sex, data = .)))
This reorganizes the data so that each dependent (ph.karno and pat.karno) is stacked on top of each other, distinguished by a dependent column. The augment function turns each model into a data frame with columns for fitted values, residuals, and other values you care about (see ?lm_tidiers for more). The .fitted column then gives the fitted values:
new_data
#> Source: local data frame [452 x 12]
#> Groups: dependent
#>
#> dependent .rownames value age sex .fitted .se.fit .resid
#> 1 ph.karno 1 90 74 1 78.86709 1.406553 11.132915
#> 2 ph.karno 2 90 68 1 80.53347 1.115994 9.466530
#> 3 ph.karno 3 90 56 1 83.86624 1.226463 6.133759
#> 4 ph.karno 4 90 57 1 83.58851 1.181024 6.411490
#> 5 ph.karno 5 100 60 1 82.75532 1.078170 17.244683
#> 6 ph.karno 6 50 74 1 78.86709 1.406553 -28.867085
#> 7 ph.karno 7 70 68 2 80.18860 1.419744 -10.188596
#> 8 ph.karno 8 60 71 2 79.35540 1.555365 -19.355404
#> 9 ph.karno 9 70 53 1 84.69943 1.388600 -14.699433
#> 10 ph.karno 10 70 61 1 82.47759 1.056850 -12.477586
#> .. ... ... ... ... ... ... ... ...
#> Variables not shown: .hat (dbl), .sigma (dbl), .cooksd (dbl), .std.resid
#> (dbl)
As one way you could use this data, you could graph how the predictions for the dependent variables differ:
ggplot(new_data, aes(age, .fitted, color = dependent, lty = factor(sex))) +
geom_line()
If you're looking to control for the age and sex, however, you probably want to work with the .resid column.
Can't you just do this?
dependents <- names(lung)[7:8]
fit <- lm(as.formula(sprintf("cbind(%s) ~ age + sex",
paste(dependents, collapse = ", "))),
data = lung)
predict(fit)
Maybe I'm misunderstanding. Your question isn't very clear.
And a third approach.
new_data <- na.omit(lung[,c("sex","age",dependents)])
result <- lapply(new_data[,dependents],
function(y)predict(lm(y~age+sex,data.frame(y=y,new_data[,c("age","sex")]))))
names(result) <- paste(names(result),"predicted",sep="_")
result <- cbind(new_data,as.data.frame(result))
head(result)
# sex age ph.karno pat.karno ph.karno_predicted pat.karno_predicted
# 1 1 74 90 100 78.83030 77.34670
# 2 1 68 90 90 80.59974 78.53841
# 3 1 56 90 90 84.13862 80.92183
# 4 1 57 90 60 83.84371 80.72321
# 5 1 60 100 90 82.95899 80.12736
# 6 1 74 50 80 78.83030 77.34670
Your original code has a couple of subtle problems (other than the fact that it doesn't run). The response variables have a few NAs, which are removed automatically by lm(...), so the prediction has fewer rows that the original data set, and when you try to add the new column with, e.g.
new_data$predicted_value <- predict(fit, newdata=new_data, type='response')
you get an error. You have to remove the NAs from new_data first, as shown in the code above.
I'm also wondering, since your data seems to be counts of something, if you should be using a poisson glm instead of lm?