How to perform complex algebraic operation by group in R?

How to perform complex algebraic operation by group in R? - r

I have data frame mydata that looks like this:
city district mean1 mean2 var
alpha A 1 2 0.5
beta A 3 1 0.2
gamma B 1.5 1 1
zeta B 2 0 3
...
omega C 1 1 2
I would like to perform some more complex arithmetic by group to be mroe specific I would like to calculate the following operation:
sqrt(n(mydata))*((mean(mydata$mean1)-mean(mydata$mean2))/sqrt(mean(mydata$var))
I tried something like this with dplyr:
resutl<-mydata %>%
group_by(district) %>%
sqrt(n(mydata))*((mean(mydata$mean1)-mean(mydata$mean2))/sqrt(mean(mydata$var))
However, the above did not work because dplyr does not recognize it as a function. Of course, one solution would be to apply summarise function to calculate all means and observation count by group, put them in new data frame and then perform the calculation above by row, but is there a more efficient way of doing this?

You could use dplyr's mutate function:
library(dplyr)
df %>%
group_by(district) %>%
mutate(calculation = n() * (mean(mean1) - mean(mean2))/sqrt(mean(var)))
returns
# A tibble: 5 x 6
# Groups: district [3]
city district mean1 mean2 var calculation
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 alpha A 1 2 0.5 1.69
2 beta A 3 1 0.2 1.69
3 gamma B 1.5 1 1 1.77
4 zeta B 2 0 3 1.77
5 omega C 1 1 2 0
Attention: I'm not sure, if you need the length of the whole dataset or just the subset. In the first case replace n() with length(df).
Data
df <- readr::read_table2("city district mean1 mean2 var
alpha A 1 2 0.5
beta A 3 1 0.2
gamma B 1.5 1 1
zeta B 2 0 3
omega C 1 1 2")

Related

Multiple T-Tests in one go in R

I have a data frame like this:
diagnosis A B C D
1 yes 1 1 0 1
2 no 0 1 0 1
3 yes 0 1 0 1
4 yes 1 1 1 1
5 yes NA 1 NA 0
6 no 1 NA 0 1
7 yes 1 0 0 0
8 no 0 0 1 1
9 no 0 1 1 NA
10 no 1 0 1 1
A, B, C, and D refer to the questions in my test and the number "1" means the participant got it right and "0" means the participant's answer is wrong.
What I want is to perform multiple two sample t-tests for each question and the total score for the test.
And these are the steps I took so far:
#calculate sum score per participant
mydf <- cbind(mydf, Total = rowSums(mydf[,2:5]))
#Reshape the tibble from wide to long format
mydf <- mydf %>%
pivot_longer(!diagnosis, names_to = "Questions", values_to = "Score")
#summary of my data
Sumdf <- mydf %>% group_by(Questions, diagnosis) %>% get_summary_stats(Score, type = "mean_sd")
Sumdf
A tibble: 10 x 6
diagnosis Questions variable n mean sd
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 no A Score 5 0.4 0.548
2 yes A Score 4 0.75 0.5
3 no B Score 4 0.5 0.577
4 yes B Score 5 0.8 0.447
5 no C Score 5 0.6 0.548
6 yes C Score 4 0.25 0.5
7 no D Score 4 1 0
8 yes D Score 5 0.6 0.548
9 no Total Score 3 2.33 0.577
10 yes Total Score 4 2.5 1.29
After this point how can I compare as a t-test those means for each question and the total score across diagnoses?
I actually found something on internet like this:
#Run T-test
ttest <- mydf %>%
group_by(Questions) %>%
t_test(Score ~ diagnosis) %>%
adjust_pvalue(method = "BH") %>%
add_significance()
And this is what I got:
But as you can see, here n values are not true(because I had NAs) and I don't know why and how adjusted p values are the same for the questions. I read that when running multiple t-tests it is better to use adjusted p values but I am not sure about it. Also, I want to include means and sd's in my table too(I actually plan to knit this script to the pdf with papaja)
So, is there any other way to run multiple t-tests or do you think what I found looks trustable and as the code suggests, I should rely on adjusted p values?
Thank you so much!

How to drop NA's out of the summarise(count = n()) function in R?

I have a dataset containing 4 organisation units (org_unit) with different number of participants and 2 Questions (Q1,Q2) on a 2-degree scale (1:2). I want to know how many people per unit answered the respective question with [1] and divide them by the total number of participants / unit.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(-9,-9,-9,-9,-9,-9,-9,-9,-9,-9)
The problem is, my Q2 only consists of [-9] which stands for non-response. I therefore assigned NA to [-9].
DF <- data.frame(Org_unit, Q1, Q2)
DF[DF == -9] <- NA
DF
Org_unit Q1 Q2
1 1 1 NA
2 1 2 NA
3 1 1 NA
4 1 2 NA
5 2 1 NA
6 2 2 NA
7 2 1 NA
8 3 2 NA
9 3 1 NA
10 4 2 NA
Next I calculated the proportion of people who answered Q1 with [1], which works fine.
prop_q1 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q1 == 1))
prop_q1
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 0.5
2 2 3 0.667
3 3 2 0.5
4 4 1 0
when i run the same code for Q2 however, I get the same amount of members per unit (count = c(1,2,3,4), although nobody answered the question and I don't want them to be registered as participants, since they technically didn't participate in the study.
prop_q2 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
prop_q2
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 NA
2 2 3 NA
3 3 2 NA
4 4 1 NA
Is there a way to calculate the right amount of members per unit when facing NA's? [-9]
Thanks!

Would
prop_q2 <- DF %>%
filter(!is.na(Q2)) %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
do the job?

Given that you want to do this across multiple columns, I think that using across() within the dplyr verbs will be better for you. I explain the solution below.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(1,-9,-9,-9,-9,-9,-9,-9,-9,-9) #Note one response
df <- tibble(Org_unit, Q1, Q2)
df %>%
mutate(across(starts_with("Q"), ~na_if(., -9))) %>%
group_by(Org_unit) %>%
summarize(across(starts_with("Q"),
list(
N = ~sum(!is.na(.)),
prop = ~sum(. == 1, na.rm = TRUE)/sum(!is.na(.)))
))
# A tibble: 4 x 5
Org_unit Q1_N Q1_prop Q2_N Q2_prop
* <dbl> <int> <dbl> <int> <dbl>
1 1 4 0.5 1 1
2 2 3 0.667 0 NaN
3 3 2 0.5 0 NaN
4 4 1 0 0 NaN
First, we take the data frame (which I created as a tibble) and substitute NA for all values that equal -9 for all columns that start with a capital "Q". This converts all question columns to have NAs in place of -9s.
Second, we group by the organizational unit and then summarize using two functions. The first sums all values where the response to the question is not NA. The string _N will be appended to columns with these values. The second calculates the proportion and will have _prop appended to the values.

How can I compute the reverse rank abundance of a species matrix?

I have a dataset in R that contains species abundance data ordered by station and replicate sample. So, column one contains the station number, column two contains the replicate number, column three contains the species name, and column four contains the species abundance.
I want to add a new fifth column that contains the reverse rank abundance of a species per station/replicate combination (i.e., If there are four species in a station/replicate, I want the species with the lowest abundance to be given a value of 1, and the species with the highest abundance to be given a value of 4).
Here a sample code of the type of dataset I am working with:
library(tidyverse)
dat <- as.data.frame(matrix(c(1,1,"A",2.34,
1,1,"B",4.32,
1,1,"C",2.46,
1,1,"D",6.32,
1,2,"A",3.54,
1,2,"B",7.67,
1,2,"D",3.45,
2,1,"D",4.67,
2,1,"E",6.54,
2,1,"G",5.67,
2,2,"B",2.31,
2,2,"G",1.12), ncol = 4, nrow = 12, byrow = TRUE
))
names(dat)[1] <- "station"
names(dat)[2] <- "replicate"
names(dat)[3] <- "taxa"
names(dat)[4] <- "abundance"
dat %>%
mutate(abundance = parse_number(abundance))
station
replicate
taxa
abundance
1
1
A
2.34
1
1
B
4.32
1
1
C
2.46
1
1
D
6.32
1
2
A
3.54
1
2
B
7.67
1
2
D
3.45
2
1
D
4.67
2
1
E
6.54
2
1
G
5.67
2
2
B
2.31
2
2
G
1.12
And here is some code to reorder the dataset so that it goes from the species with the lowest abundance to the species with the highest abundance per station/replicate:
dat %>%
arrange(abundance) %>%
arrange(replicate) %>%
arrange(station)
For some reason, I am unsure how to continue from here. Any help would be greatly appreciated!

If you want to rank by station/replicate, you first group_by this combination, and then create a new column with the rank value.
library(tidyverse)
dat %>%
group_by(station, replicate) %>%
mutate(abundance = as.numeric(abundance),
rank = rank(abundance))
Output
station replicate taxa abundance rank
<chr> <chr> <chr> <dbl> <dbl>
1 1 1 A 2.34 1
2 1 1 B 4.32 3
3 1 1 C 2.46 2
4 1 1 D 6.32 4
5 1 2 A 3.54 2
6 1 2 B 7.67 3
7 1 2 D 3.45 1
8 2 1 D 4.67 1
9 2 1 E 6.54 3
10 2 1 G 5.67 2
11 2 2 B 2.31 2
12 2 2 G 1.12 1

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!

It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25

For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]

Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Combine group_by, ifelse and filter

I would like to combine group_by, ifelse and filter my code for the example dataframe below. What I would like is the following: 1) Group by x. 2) Check if result > 1. If TRUE, check if month for which result >1 == max(month) for that group. If TRUE, select all rows for that group. All other rows should be discarded (so both in case result <= 1 or (month where result > 1 != max(month)) . So in my example data frame all rows for B should be kept and all rows for A should be discarded.
x month result
1 A 1 0.5
2 A 2 0.6
3 A 3 1.2
4 A 4 1.1
5 A 5 0.9
6 B 1 0.3
7 B 2 0.4
8 B 3 0.5
9 B 4 0.9
10 B 5 1.2
dat <- data.frame(x = c("A","A","A","A","A","B","B","B","B","B"),
month = c(1,2,3,4,5,1,2,3,4,5),
result = c(.5,.6,1.2,1.1,.9,.3,.4,.5,.9,1.2))

Using data.table
library(data.table)
setDT(dat)[, .SD[result[which.max(month)] > 1], x]
# x month result
#1: B 1 0.3
#2: B 2 0.4
#3: B 3 0.5
#4: B 4 0.9
#5: B 5 1.2
Or with dplyr
library(dplyr)
dat %>%
group_by(x) %>%
filter(result[which.max(month)] > 1)
# A tibble: 5 x 3
# Groups: x [1]
# x month result
# <fct> <dbl> <dbl>
#1 B 1 0.3
#2 B 2 0.4
#3 B 3 0.5
#4 B 4 0.9
#5 B 5 1.2

If you want to stay in the tidyverse and not venture into base selection, we can easily get there, as well, by just using any to check whether any in the group meet your critera:
dat %>%
group_by(x) %>%
filter(any(result > 1 & month == max(month)))
# A tibble: 5 x 3
# Groups: x [1]
x month result
<fct> <dbl> <dbl>
1 B 1 0.3
2 B 2 0.4
3 B 3 0.5
4 B 4 0.9
5 B 5 1.2
Alternatively, sometimes I'll create a "keep" variable to check if I've got the right ones, initially, or to make the code more easily readable by someone looking at my code years later:
dat %>%
group_by(x) %>%
mutate(keep = (result > 1 & month == max(month))) %>%
filter(any(keep))

Here is a solution With base R (without group_by or filter)
res <- Reduce(rbind,lapply(split(dat,dat$x), function(v) {
if (v$result[which.max(v$month)]>1) v else NULL}))
such that
> res
x month result
6 B 1 0.3
7 B 2 0.4
8 B 3 0.5
9 B 4 0.9
10 B 5 1.2

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to perform complex algebraic operation by group in R? - r

Related

Multiple T-Tests in one go in R

How to drop NA's out of the summarise(count = n()) function in R?

How can I compute the reverse rank abundance of a species matrix?

Determine percentage of rows with missing values in a dataframe in R

Combine group_by, ifelse and filter

Categories

Resources