Related
I have a data frame like this:
diagnosis A B C D
1 yes 1 1 0 1
2 no 0 1 0 1
3 yes 0 1 0 1
4 yes 1 1 1 1
5 yes NA 1 NA 0
6 no 1 NA 0 1
7 yes 1 0 0 0
8 no 0 0 1 1
9 no 0 1 1 NA
10 no 1 0 1 1
A, B, C, and D refer to the questions in my test and the number "1" means the participant got it right and "0" means the participant's answer is wrong.
What I want is to perform multiple two sample t-tests for each question and the total score for the test.
And these are the steps I took so far:
#calculate sum score per participant
mydf <- cbind(mydf, Total = rowSums(mydf[,2:5]))
#Reshape the tibble from wide to long format
mydf <- mydf %>%
pivot_longer(!diagnosis, names_to = "Questions", values_to = "Score")
#summary of my data
Sumdf <- mydf %>% group_by(Questions, diagnosis) %>% get_summary_stats(Score, type = "mean_sd")
Sumdf
A tibble: 10 x 6
diagnosis Questions variable n mean sd
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 no A Score 5 0.4 0.548
2 yes A Score 4 0.75 0.5
3 no B Score 4 0.5 0.577
4 yes B Score 5 0.8 0.447
5 no C Score 5 0.6 0.548
6 yes C Score 4 0.25 0.5
7 no D Score 4 1 0
8 yes D Score 5 0.6 0.548
9 no Total Score 3 2.33 0.577
10 yes Total Score 4 2.5 1.29
After this point how can I compare as a t-test those means for each question and the total score across diagnoses?
I actually found something on internet like this:
#Run T-test
ttest <- mydf %>%
group_by(Questions) %>%
t_test(Score ~ diagnosis) %>%
adjust_pvalue(method = "BH") %>%
add_significance()
And this is what I got:
But as you can see, here n values are not true(because I had NAs) and I don't know why and how adjusted p values are the same for the questions. I read that when running multiple t-tests it is better to use adjusted p values but I am not sure about it. Also, I want to include means and sd's in my table too(I actually plan to knit this script to the pdf with papaja)
So, is there any other way to run multiple t-tests or do you think what I found looks trustable and as the code suggests, I should rely on adjusted p values?
Thank you so much!
I have a dataset containing 4 organisation units (org_unit) with different number of participants and 2 Questions (Q1,Q2) on a 2-degree scale (1:2). I want to know how many people per unit answered the respective question with [1] and divide them by the total number of participants / unit.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(-9,-9,-9,-9,-9,-9,-9,-9,-9,-9)
The problem is, my Q2 only consists of [-9] which stands for non-response. I therefore assigned NA to [-9].
DF <- data.frame(Org_unit, Q1, Q2)
DF[DF == -9] <- NA
DF
Org_unit Q1 Q2
1 1 1 NA
2 1 2 NA
3 1 1 NA
4 1 2 NA
5 2 1 NA
6 2 2 NA
7 2 1 NA
8 3 2 NA
9 3 1 NA
10 4 2 NA
Next I calculated the proportion of people who answered Q1 with [1], which works fine.
prop_q1 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q1 == 1))
prop_q1
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 0.5
2 2 3 0.667
3 3 2 0.5
4 4 1 0
when i run the same code for Q2 however, I get the same amount of members per unit (count = c(1,2,3,4), although nobody answered the question and I don't want them to be registered as participants, since they technically didn't participate in the study.
prop_q2 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
prop_q2
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 NA
2 2 3 NA
3 3 2 NA
4 4 1 NA
Is there a way to calculate the right amount of members per unit when facing NA's? [-9]
Thanks!
Would
prop_q2 <- DF %>%
filter(!is.na(Q2)) %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
do the job?
Given that you want to do this across multiple columns, I think that using across() within the dplyr verbs will be better for you. I explain the solution below.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(1,-9,-9,-9,-9,-9,-9,-9,-9,-9) #Note one response
df <- tibble(Org_unit, Q1, Q2)
df %>%
mutate(across(starts_with("Q"), ~na_if(., -9))) %>%
group_by(Org_unit) %>%
summarize(across(starts_with("Q"),
list(
N = ~sum(!is.na(.)),
prop = ~sum(. == 1, na.rm = TRUE)/sum(!is.na(.)))
))
# A tibble: 4 x 5
Org_unit Q1_N Q1_prop Q2_N Q2_prop
* <dbl> <int> <dbl> <int> <dbl>
1 1 4 0.5 1 1
2 2 3 0.667 0 NaN
3 3 2 0.5 0 NaN
4 4 1 0 0 NaN
First, we take the data frame (which I created as a tibble) and substitute NA for all values that equal -9 for all columns that start with a capital "Q". This converts all question columns to have NAs in place of -9s.
Second, we group by the organizational unit and then summarize using two functions. The first sums all values where the response to the question is not NA. The string _N will be appended to columns with these values. The second calculates the proportion and will have _prop appended to the values.
I have a dataset in R that contains species abundance data ordered by station and replicate sample. So, column one contains the station number, column two contains the replicate number, column three contains the species name, and column four contains the species abundance.
I want to add a new fifth column that contains the reverse rank abundance of a species per station/replicate combination (i.e., If there are four species in a station/replicate, I want the species with the lowest abundance to be given a value of 1, and the species with the highest abundance to be given a value of 4).
Here a sample code of the type of dataset I am working with:
library(tidyverse)
dat <- as.data.frame(matrix(c(1,1,"A",2.34,
1,1,"B",4.32,
1,1,"C",2.46,
1,1,"D",6.32,
1,2,"A",3.54,
1,2,"B",7.67,
1,2,"D",3.45,
2,1,"D",4.67,
2,1,"E",6.54,
2,1,"G",5.67,
2,2,"B",2.31,
2,2,"G",1.12), ncol = 4, nrow = 12, byrow = TRUE
))
names(dat)[1] <- "station"
names(dat)[2] <- "replicate"
names(dat)[3] <- "taxa"
names(dat)[4] <- "abundance"
dat %>%
mutate(abundance = parse_number(abundance))
station
replicate
taxa
abundance
1
1
A
2.34
1
1
B
4.32
1
1
C
2.46
1
1
D
6.32
1
2
A
3.54
1
2
B
7.67
1
2
D
3.45
2
1
D
4.67
2
1
E
6.54
2
1
G
5.67
2
2
B
2.31
2
2
G
1.12
And here is some code to reorder the dataset so that it goes from the species with the lowest abundance to the species with the highest abundance per station/replicate:
dat %>%
arrange(abundance) %>%
arrange(replicate) %>%
arrange(station)
For some reason, I am unsure how to continue from here. Any help would be greatly appreciated!
If you want to rank by station/replicate, you first group_by this combination, and then create a new column with the rank value.
library(tidyverse)
dat %>%
group_by(station, replicate) %>%
mutate(abundance = as.numeric(abundance),
rank = rank(abundance))
Output
station replicate taxa abundance rank
<chr> <chr> <chr> <dbl> <dbl>
1 1 1 A 2.34 1
2 1 1 B 4.32 3
3 1 1 C 2.46 2
4 1 1 D 6.32 4
5 1 2 A 3.54 2
6 1 2 B 7.67 3
7 1 2 D 3.45 1
8 2 1 D 4.67 1
9 2 1 E 6.54 3
10 2 1 G 5.67 2
11 2 2 B 2.31 2
12 2 2 G 1.12 1
I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4
I would like to combine group_by, ifelse and filter my code for the example dataframe below. What I would like is the following: 1) Group by x. 2) Check if result > 1. If TRUE, check if month for which result >1 == max(month) for that group. If TRUE, select all rows for that group. All other rows should be discarded (so both in case result <= 1 or (month where result > 1 != max(month)) . So in my example data frame all rows for B should be kept and all rows for A should be discarded.
x month result
1 A 1 0.5
2 A 2 0.6
3 A 3 1.2
4 A 4 1.1
5 A 5 0.9
6 B 1 0.3
7 B 2 0.4
8 B 3 0.5
9 B 4 0.9
10 B 5 1.2
dat <- data.frame(x = c("A","A","A","A","A","B","B","B","B","B"),
month = c(1,2,3,4,5,1,2,3,4,5),
result = c(.5,.6,1.2,1.1,.9,.3,.4,.5,.9,1.2))
Using data.table
library(data.table)
setDT(dat)[, .SD[result[which.max(month)] > 1], x]
# x month result
#1: B 1 0.3
#2: B 2 0.4
#3: B 3 0.5
#4: B 4 0.9
#5: B 5 1.2
Or with dplyr
library(dplyr)
dat %>%
group_by(x) %>%
filter(result[which.max(month)] > 1)
# A tibble: 5 x 3
# Groups: x [1]
# x month result
# <fct> <dbl> <dbl>
#1 B 1 0.3
#2 B 2 0.4
#3 B 3 0.5
#4 B 4 0.9
#5 B 5 1.2
If you want to stay in the tidyverse and not venture into base selection, we can easily get there, as well, by just using any to check whether any in the group meet your critera:
dat %>%
group_by(x) %>%
filter(any(result > 1 & month == max(month)))
# A tibble: 5 x 3
# Groups: x [1]
x month result
<fct> <dbl> <dbl>
1 B 1 0.3
2 B 2 0.4
3 B 3 0.5
4 B 4 0.9
5 B 5 1.2
Alternatively, sometimes I'll create a "keep" variable to check if I've got the right ones, initially, or to make the code more easily readable by someone looking at my code years later:
dat %>%
group_by(x) %>%
mutate(keep = (result > 1 & month == max(month))) %>%
filter(any(keep))
Here is a solution With base R (without group_by or filter)
res <- Reduce(rbind,lapply(split(dat,dat$x), function(v) {
if (v$result[which.max(v$month)]>1) v else NULL}))
such that
> res
x month result
6 B 1 0.3
7 B 2 0.4
8 B 3 0.5
9 B 4 0.9
10 B 5 1.2