How can I compute the reverse rank abundance of a species matrix? - r

I have a dataset in R that contains species abundance data ordered by station and replicate sample. So, column one contains the station number, column two contains the replicate number, column three contains the species name, and column four contains the species abundance.
I want to add a new fifth column that contains the reverse rank abundance of a species per station/replicate combination (i.e., If there are four species in a station/replicate, I want the species with the lowest abundance to be given a value of 1, and the species with the highest abundance to be given a value of 4).
Here a sample code of the type of dataset I am working with:
library(tidyverse)
dat <- as.data.frame(matrix(c(1,1,"A",2.34,
1,1,"B",4.32,
1,1,"C",2.46,
1,1,"D",6.32,
1,2,"A",3.54,
1,2,"B",7.67,
1,2,"D",3.45,
2,1,"D",4.67,
2,1,"E",6.54,
2,1,"G",5.67,
2,2,"B",2.31,
2,2,"G",1.12), ncol = 4, nrow = 12, byrow = TRUE
))
names(dat)[1] <- "station"
names(dat)[2] <- "replicate"
names(dat)[3] <- "taxa"
names(dat)[4] <- "abundance"
dat %>%
mutate(abundance = parse_number(abundance))
station
replicate
taxa
abundance
1
1
A
2.34
1
1
B
4.32
1
1
C
2.46
1
1
D
6.32
1
2
A
3.54
1
2
B
7.67
1
2
D
3.45
2
1
D
4.67
2
1
E
6.54
2
1
G
5.67
2
2
B
2.31
2
2
G
1.12
And here is some code to reorder the dataset so that it goes from the species with the lowest abundance to the species with the highest abundance per station/replicate:
dat %>%
arrange(abundance) %>%
arrange(replicate) %>%
arrange(station)
For some reason, I am unsure how to continue from here. Any help would be greatly appreciated!

If you want to rank by station/replicate, you first group_by this combination, and then create a new column with the rank value.
library(tidyverse)
dat %>%
group_by(station, replicate) %>%
mutate(abundance = as.numeric(abundance),
rank = rank(abundance))
Output
station replicate taxa abundance rank
<chr> <chr> <chr> <dbl> <dbl>
1 1 1 A 2.34 1
2 1 1 B 4.32 3
3 1 1 C 2.46 2
4 1 1 D 6.32 4
5 1 2 A 3.54 2
6 1 2 B 7.67 3
7 1 2 D 3.45 1
8 2 1 D 4.67 1
9 2 1 E 6.54 3
10 2 1 G 5.67 2
11 2 2 B 2.31 2
12 2 2 G 1.12 1

Related

Multiple T-Tests in one go in R

I have a data frame like this:
diagnosis A B C D
1 yes 1 1 0 1
2 no 0 1 0 1
3 yes 0 1 0 1
4 yes 1 1 1 1
5 yes NA 1 NA 0
6 no 1 NA 0 1
7 yes 1 0 0 0
8 no 0 0 1 1
9 no 0 1 1 NA
10 no 1 0 1 1
A, B, C, and D refer to the questions in my test and the number "1" means the participant got it right and "0" means the participant's answer is wrong.
What I want is to perform multiple two sample t-tests for each question and the total score for the test.
And these are the steps I took so far:
#calculate sum score per participant
mydf <- cbind(mydf, Total = rowSums(mydf[,2:5]))
#Reshape the tibble from wide to long format
mydf <- mydf %>%
pivot_longer(!diagnosis, names_to = "Questions", values_to = "Score")
#summary of my data
Sumdf <- mydf %>% group_by(Questions, diagnosis) %>% get_summary_stats(Score, type = "mean_sd")
Sumdf
A tibble: 10 x 6
diagnosis Questions variable n mean sd
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 no A Score 5 0.4 0.548
2 yes A Score 4 0.75 0.5
3 no B Score 4 0.5 0.577
4 yes B Score 5 0.8 0.447
5 no C Score 5 0.6 0.548
6 yes C Score 4 0.25 0.5
7 no D Score 4 1 0
8 yes D Score 5 0.6 0.548
9 no Total Score 3 2.33 0.577
10 yes Total Score 4 2.5 1.29
After this point how can I compare as a t-test those means for each question and the total score across diagnoses?
I actually found something on internet like this:
#Run T-test
ttest <- mydf %>%
group_by(Questions) %>%
t_test(Score ~ diagnosis) %>%
adjust_pvalue(method = "BH") %>%
add_significance()
And this is what I got:
But as you can see, here n values are not true(because I had NAs) and I don't know why and how adjusted p values are the same for the questions. I read that when running multiple t-tests it is better to use adjusted p values but I am not sure about it. Also, I want to include means and sd's in my table too(I actually plan to knit this script to the pdf with papaja)
So, is there any other way to run multiple t-tests or do you think what I found looks trustable and as the code suggests, I should rely on adjusted p values?
Thank you so much!

How to sample across a dataset with two factors in it?

I have a dataframe with two species A and B and certain variables a b associated with the total of 100 rows.
I want to create a sampler such that in one set it randomly picks 6 rows reps from the df dataset. However, the samples for A must only come from rows associated with sp A from df, similarly from B. I want do this for 500 times over for each of species A and B.
I attempted a for loop and when I ran sampling it shows a single row with 6 columns. I would appreciate any guidance
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
df.sample <- for(i in 1:1000){
sampling <- sample(df[i,],6,replace = TRUE)
}
#Output in a single row
a a.1 sp b sp.1 a.2
1000 1.68951 1.68951 B 1.395995 B 1.68951
#Expected dataframe
df.sample
set rep a b sp
1 1 1 9 A
1 2 3 2 A
1 3 0 2 A
1 4 1 2 A
1 5 1 6 A
1 6 4 2 A
2 1 1 2 B
2 2 5 2 B
2 3 1 2 B
2 4 1 6 B
2 5 1 8 B
2 6 9 2 B
....
Here's how I would do it (using tidyverse):
data:
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# create an empty table with desired columns
library(tidyverse)
output <- tibble(a = numeric(),
b = numeric(),
sp = character(),
set = numeric())
# sampling in a loop
set.seed(42)
for(i in 1:500){
samp1 <- df %>% filter(sp == 'A') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
samp2 <- df %>% filter(sp == 'B') %>% sample_n(6, replace = TRUE) %>% mutate(set = i)
output %>% add_row(bind_rows(samp1, samp2)) -> output
}
Result
> head(output, 20)
# A tibble: 20 × 4
a b sp set
<dbl> <dbl> <chr> <dbl>
1 2.59 3.31 A 1
2 1.84 1.66 A 1
3 2.35 1.17 A 1
4 2.33 1.95 A 1
5 0.418 1.11 A 1
6 1.19 2.54 A 1
7 2.35 0.899 B 1
8 1.19 1.63 B 1
9 0.901 0.986 B 1
10 3.12 1.75 B 1
11 2.28 2.61 B 1
12 1.37 3.47 B 1
13 2.33 1.95 A 2
14 1.84 1.66 A 2
15 3.76 1.26 A 2
16 2.96 3.10 A 2
17 1.03 1.81 A 2
18 1.42 2.00 A 2
19 0.901 0.986 B 2
20 2.37 1.39 B 2
You could split df by species at first. Random rows in each species can be drawn by x[sample(nrow(x), 6), ]. Pass it into replicate(), you could do sampling for many times. Here dplyr::bind_rows() is used to combine samples and add a new column set indicating the sampling indices.
lapply(split(df, df$sp), function(x) {
dplyr::bind_rows(
replicate(3, x[sample(nrow(x), 6), ], FALSE),
.id = "set"
)
})
Output
$A
set a b sp
1 1 1.52480034 3.41257975 A
2 1 1.82542370 2.08511584 A
3 1 1.80019901 1.39279162 A
4 1 2.20765154 2.11879412 A
5 1 1.61295185 2.04035172 A
6 1 1.92936567 2.90362816 A
7 2 0.88903679 2.46948106 A
8 2 3.19223788 2.81329767 A
9 2 1.28629416 2.69275525 A
10 2 2.61044815 0.82495427 A
11 2 2.30928735 1.67421328 A
12 2 -0.09789704 2.62434719 A
13 3 2.10386603 1.78157862 A
14 3 2.17542841 0.84016203 A
15 3 3.22202227 3.49863423 A
16 3 1.07929909 -0.02032945 A
17 3 2.95271838 2.34460193 A
18 3 1.90414536 1.54089645 A
$B
set a b sp
1 1 3.5130317 -0.4704879 B
2 1 3.0053072 1.6021795 B
3 1 4.1167657 1.1123342 B
4 1 1.5460589 3.2915979 B
5 1 0.8742753 0.9132530 B
6 1 2.0882660 1.5588471 B
7 2 1.2444645 1.8199525 B
8 2 2.7960117 2.6657735 B
9 2 2.5970774 0.9984187 B
10 2 1.1977317 3.7360884 B
11 2 2.2830643 1.0452440 B
12 2 3.1047150 1.5609482 B
13 3 2.9309124 1.5679255 B
14 3 0.8631965 1.3501631 B
15 3 1.5460589 3.2915979 B
16 3 2.7960117 2.6657735 B
17 3 3.1047150 1.5609482 B
18 3 2.8735390 0.6329279 B
If I understood well what you want, it could be done following this code
# Create the initial data frame
a <- rnorm(100, 2,1)
b <- rnorm(100, 2,1)
sp <- rep(c("A","B"), each = 50)
df <- data.frame(a,b,sp)
# Rows with sp=A
row.A <- which(df$sp=="A")
row.B <- which(df$sp=="B")
# Sampling data.frame
sampling <- data.frame(matrix(ncol = 5, nrow = 0))
# "rep" column for each iteration
rep1 <- rep(1:6,2)
# Build the dara.frame
for(i in 1:500){
# Sampling row.A
s.A <- sample(row.A,6,replace = T)
# Sampling row.B
s.B <- sample(row.B,6,replace = T)
# Data frame with the subset of df and "set" and "rep" values
sampling <- rbind(sampling, set=cbind(rep(i,12),rep=rep1,df[c(s.A,s.B),]))
}
# Delete row.names of sampling and redefine sampling's column names
row.names(sampling) <- NULL
colnames(sampling) <- c("set", "rep", "a", "b", "sp")
And the output looks like this:
set rep a b sp
1 1 3.713663 2.717456 A
1 2 2.456070 2.803443 A
1 3 2.166655 1.395556 A
1 4 1.453738 5.662969 A
1 5 2.692518 2.971156 A
1 6 2.699634 3.016791 A

How to perform complex algebraic operation by group in R?

I have data frame mydata that looks like this:
city district mean1 mean2 var
alpha A 1 2 0.5
beta A 3 1 0.2
gamma B 1.5 1 1
zeta B 2 0 3
...
omega C 1 1 2
I would like to perform some more complex arithmetic by group to be mroe specific I would like to calculate the following operation:
sqrt(n(mydata))*((mean(mydata$mean1)-mean(mydata$mean2))/sqrt(mean(mydata$var))
I tried something like this with dplyr:
resutl<-mydata %>%
group_by(district) %>%
sqrt(n(mydata))*((mean(mydata$mean1)-mean(mydata$mean2))/sqrt(mean(mydata$var))
However, the above did not work because dplyr does not recognize it as a function. Of course, one solution would be to apply summarise function to calculate all means and observation count by group, put them in new data frame and then perform the calculation above by row, but is there a more efficient way of doing this?
You could use dplyr's mutate function:
library(dplyr)
df %>%
group_by(district) %>%
mutate(calculation = n() * (mean(mean1) - mean(mean2))/sqrt(mean(var)))
returns
# A tibble: 5 x 6
# Groups: district [3]
city district mean1 mean2 var calculation
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 alpha A 1 2 0.5 1.69
2 beta A 3 1 0.2 1.69
3 gamma B 1.5 1 1 1.77
4 zeta B 2 0 3 1.77
5 omega C 1 1 2 0
Attention: I'm not sure, if you need the length of the whole dataset or just the subset. In the first case replace n() with length(df).
Data
df <- readr::read_table2("city district mean1 mean2 var
alpha A 1 2 0.5
beta A 3 1 0.2
gamma B 1.5 1 1
zeta B 2 0 3
omega C 1 1 2")

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

Chisq Test R : Multiple group in data frame

I'm new on R and trying to run some statistical test.
My data looks like that :
Name Freqeunce Target Total
Steve 1 A 11
Marcel 1 A 11
Marie 1 A 11
John 2 A 11
Max 2 A 11
Alice 4 A 11
Mariane 1 B 1
Rose 1 C 3
Carla 1 C 3
Happy 1 C 3
I want to realise a Chi2 of homogeneity for each target type ( A, B and C).
I want to know if there is possibility with R to run a loop that will write the p.value of each name in a column or did i have to extract the information before and then realize the Chi2 ?
The objectif is to identify which the different name are less represented in the group according to the frequences. And there is more than 2000 groups, thats why i want a loop.
Thank you for your answer
Baptiste
I think this will answer your question. I don't know if this is the type of chi^2 test you want but you can always change the function. I use group_by and mutate from the dplyr package and write a function to perform the chi^2 test and extract the pvalue.
library(dplyr)
df <- read.table("test2.txt", header = T)
c2_all <- function(x,y){
mat <- matrix(c(x,y),nrow = 2)
c2 <- chisq.test(mat)
return(c2$p.value)
}
result <- df2 %>% group_by(Target) %>% mutate(pvalue = c2_all(Name,Freqeunce))
result
# A tibble: 11 x 5
# Groups: Target [3]
Name Freqeunce Target Total pvalue
<fct> <int> <fct> <int> <dbl>
1 Steve 1 A 11 0.285
2 Marcel 1 A 11 0.285
3 Marie 1 A 11 0.285
4 John 2 A 11 0.285
5 Max 2 A 11 0.285
6 Alice 4 A 11 0.285
7 Sarah 2 B 3 1.00
8 Mariane 1 B 3 1.00
9 Rose 1 C 5 0.223
10 Carla 3 C 5 0.223
11 Happy 1 C 5 0.223

Resources