Nonparametric test to compare rows in different dataframes in R

Nonparametric test to compare rows in different dataframes in R - r

This is my first post here.
I have 4 dataframes for which I would like to do stepwise nonparametric tests for each row.
Eg. I would like to compare the values for each row in dataframe A with the values for each row in dataframe B.
I would need a non parametric test eg. Wilcoxon or whatever.
I thought of making a new column with the median, but I am certain that there is something better.
Could you give me an idea how to do this?
Thank you in advance!
Edit:
Here are my imaginary dataframes.
I want to compare each dataframe row-wise eg do a nonparametric test for John in dataframes A and B, then for Dora, etc.
A <- data.frame("A" = c("John","Dora","Robert","Jim"),
"A1" = c(8,1,10,5),
"A2"= c(9,1,1,4))
B <- data.frame("B" = c("John","Dora","Robert","Jim"),
"B1" = c(1,1,1,5),
"B2"= c(3,2,1,5),
"B3"=c(4,3,1,5),
"B4"=c(6,8,8,1))

I think you are looking for the function wilcox.test (in stats package).
Solution 1: Using a for loop
One way to compare each row of A with the corresponding row of B (and extract the p value) is to create a for loop such as this:
pval = NULL
for(i in 1:nrow(A))
{
vec_a = as.numeric(A[i,2:ncol(A)])
vec_b = as.numeric(B[B$B == A$A[i],2:ncol(B)])
p <- wilcox.test(vec_a,vec_b)
pval = c(pval, p$p.value)
print(p)
}
At the end, you will get a vector pval containing the pvalue for each row.
pval
[1] 0.1333333 0.2188194 0.5838824 1.0000000
Solution 2: Using tidyverse
A more elegant solution is to have the use of the tidyverse packages (in particular dplyr and tidyr) to assemble your dataframe into a single one, and compare each name by group by passing a formula in the function wilcox.test.
First, we can merge your dataframes by their name using left_join function from dplyr:
library(dplyr)
DF <- left_join(A,B, by = c("A"="B"))
A A1 A2 B1 B2 B3 B4
1 John 8 9 1 3 4 6
2 Dora 1 1 1 2 3 8
3 Robert 10 1 1 1 1 8
4 Jim 5 4 5 5 5 1
Then using dplyr and tidyr packages, you can reshape your dataframe into a longer format:
library(dplyr)
library(tidyr)
DF %>% pivot_longer(., -A, names_to = "var", values_to = "values")
# A tibble: 24 x 3
A var values
<fct> <chr> <dbl>
1 John A1 8
2 John A2 9
3 John B1 1
4 John B2 3
5 John B3 4
6 John B4 6
7 Dora A1 1
8 Dora A2 1
9 Dora B1 1
10 Dora B2 2
# … with 14 more rows
We will create a new column "group" that will indicate A or B depending of values in the column var:
DF %>% pivot_longer(., -A, names_to = "var", values_to = "values") %>%
mutate(group = gsub("\\d","",var))
# A tibble: 24 x 4
A var values group
<fct> <chr> <dbl> <chr>
1 John A1 8 A
2 John A2 9 A
3 John B1 1 B
4 John B2 3 B
5 John B3 4 B
6 John B4 6 B
7 Dora A1 1 A
8 Dora A2 1 A
9 Dora B1 1 B
10 Dora B2 2 B
# … with 14 more rows
Finally, we can group by A and summarise the dataframe to get the p value of the function wilcox.test when comparing values in each group for each name:
DF %>% pivot_longer(., -A, names_to = "var", values_to = "values") %>%
mutate(group = gsub("\\d","",var)) %>%
group_by(A) %>%
summarise(Pval = wilcox.test(values~group)$p.value)
# A tibble: 4 x 2
A Pval
<fct> <dbl>
1 Dora 0.219
2 Jim 1
3 John 0.133
4 Robert 0.584
It looks longer (especially because I explain each steps) but at the end, you can see that we need fewer lines than the first solution.
Does it answer your question ?

Related

Infill missing variables of a df from a list

I have missing categorical variables in a list. I would like to add all the combinations of these classifications to the data frame using complete. I can do this for a single variable using mutate.
Simplified example:
library(tidyverse)
df <- tibble(a1 = 1:6,
b1 = rep(c(1,2),3),
c1 = rep(c(1:3), 2))
missing_cols <- list(d1 = c(7:8),
e1 = c(12:14))
# Use the first classification of d1 for mutate and complete with all classifications
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]])
Desired output
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
mutate(!!names(missing_cols)[2] := missing_cols[[2]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]], e1 = missing_cols[[2]])
This will get the correct output for d1. How can I do this for all variables in my list?

We can use crossing with cross_df :
library(tidyr)
crossing(df, cross_df(missing_cols))
# a1 b1 c1 d1 e1
# <int> <dbl> <int> <int> <int>
# 1 1 1 1 7 12
# 2 1 1 1 7 13
# 3 1 1 1 7 14
# 4 1 1 1 8 12
# 5 1 1 1 8 13
# 6 1 1 1 8 14
# 7 2 2 2 7 12
# 8 2 2 2 7 13
# 9 2 2 2 7 14
#10 2 2 2 8 12
# … with 26 more rows
cross_df creates all possible combination of missing_cols while crossing takes that output and creates all possible combination with df.

Using expand.grid
library(tidyr)
crossing(df, expand.grid(missing_cols))

R calculating the sum of values according to condition

Here is a data frame:
ID<-c(rep("A",3),rep("B",2), rep("C",3),rep("D",5))
cell<-c("a1","a2","a3","a1","a2","a1","a2", "a3","a1","a2","a1","a2","a3")
value<-c(2,5,3,4,5,6,9,8,7,2,5,2,4)
df<-as.data.frame(cbind(ID, cell, value))
I want to calculate the sum of all values for each ID up to cell a2 (incl.). The sequence of cells and ID’s must be taken into account. If there isn’t any cell “a2” after calculating of the sum, this rows should not be taken into account.
As a result I would like to get this table:
Could You please help me to code this condition?
Thanks in advance.
Best regards, Inna

assuming the file is already correctly ordered by cell
library( tidyverse )
df %>%
group_by( ID ) %>%
mutate( value = cumsum( value ) ) %>%
filter( cell == "a2" )
# # A tibble: 5 x 3
# # Groups: ID [4]
# ID cell value
# <chr> <chr> <dbl>
# 1 A a2 7
# 2 B a2 9
# 3 C a2 15
# 4 D a2 9
# 5 D a2 16

Treating each occurrence of "a2" as different group we can do :
library(dplyr)
df %>%
#Create a group column with every value of cell == 'a2' as different group
group_by(ID, grp = cumsum(lag(cell == 'a2', default = TRUE))) %>%
#Remove those groups that do not have 'a2' in them
filter(any(cell == 'a2')) %>%
#Sum till 'a2' value
summarise(value = sum(value[seq_len(match('a2', cell))]),
cell = last(cell)) %>%
select(-grp)
# ID value cell
# <chr> <dbl> <chr>
#1 A 7 a2
#2 B 9 a2
#3 C 15 a2
#4 D 9 a2
#5 D 7 a2

A succinct solution using ave.
r <- transform(df, value=ave(value, ID, FUN=cumsum))[df$cell == "a2", ]
r
# ID cell value
# 2 A a2 7
# 5 B a2 9
# 7 C a2 15
# 10 D a2 9
# 12 D a2 16

An option with data.table
library(data.table)
setDT(df)[, value := cumsum(value) , ID][cell == 'a2']
-output
# ID cell value
#1: A a2 7
#2: B a2 9
#3: C a2 15
#4: D a2 9
#5: D a2 16

Group data hierarchically on two levels, then compute relative frequencies in R using dplyr [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 3 years ago.
I want to do something which appears simple, but I don't have a good feel for R yet, it is a maze of twisty passages, all different.
I have a table with several variables, and I want to group on two variables ... I want a two-level hierarchical grouping, also known as a tree. This can evidently be done using the group_by function of dplyr.
And then I want to compute marginal statistics (in this case, relative frequencies) based on group counts for level 1 and level 2.
In pictures, given this table of 18 rows:
I want this table of 6 rows:
Is there a simple way to do this in dplyr? (I can do it in SQL, but ...)
Edited for example
For example, based on the nycflights13 package:
library(dplyr)
install.packages("nycflights13")
require(nycflights13)
data(flights) # contains information about flights, one flight per row
ff <- flights %>%
mutate(approx_dist = floor((distance + 999)/1000)*1000) %>%
select(carrier, approx_dist) %>%
group_by(carrier, approx_dist) %>%
summarise(n = n()) %>%
arrange(carrier, approx_dist)
This creates a tbl ff with the number of flights for each pair of (carrier, inter-airport-distance-rounded-to-1000s):
# A tibble: 33 x 3
# Groups: carrier [16]
carrier approx_dist n
<chr> <dbl> <int>
1 9E 1000 15740
2 9E 2000 2720
3 AA 1000 9146
4 AA 2000 17210
5 AA 3000 6373
And now I would like to compute the relative frequencies for the "approx_dist" values in each "carrier" group, for example, I would like to get:
carrier approx_dist n rel_freq
<chr> <dbl> <int>
1 9E 1000 15740 15740/(15740+2720)
2 9E 2000 2720 2720/(15740+2720)

If I understood your problem correctly, here is what you can do. This is not to exactly solve your problem (we don't have the data), but to give you some hints:
library(dplyr)
d <- data.frame(col1= rep(c("a", "a", "a", "b", "b", "b"),2),
col2 = rep(c("a1", "a2", "a3", "b1", "b2", "b3"),2),
stringsAsFactors = F)
d %>% group_by(col1) %>% mutate(count_g1 = n()) %>% ungroup() %>%
group_by(col1, col2) %>% summarise(rel_freq = n()/unique(count_g1)) %>% ungroup()
# # A tibble: 6 x 3
# col1 col2 rel_freq
# <chr> <chr> <dbl>
# 1 a a1 0.333
# 2 a a2 0.333
# 3 a a3 0.333
# 4 b b1 0.333
# 5 b b2 0.333
# 6 b b3 0.333
Update: #TimTeaFan's suggestion on how to re-write the code above using prop.table
d %>% group_by(col1, col2) %>% summarise(n = n()) %>% mutate(freq = prop.table(n))
Update: Running this trick on the ff table given in the question's example, which has everything set up except the last mutate:
ff %>% mutate(rel_freq = prop.table(n))
# A tibble: 33 x 4
# Groups: carrier [16]
carrier approx_dist n rel_freq
<chr> <dbl> <int> <dbl>
1 9E 1000 15740 0.853
2 9E 2000 2720 0.147
3 AA 1000 9146 0.279
4 AA 2000 17210 0.526
5 AA 3000 6373 0.195
6 AS 3000 714 1
7 B6 1000 24613 0.450
8 B6 2000 22159 0.406
9 B6 3000 7863 0.144
10 DL 1000 20014 0.416
# … with 23 more rows
...or
ff %>% mutate(rel_freq = n/sum(n))

Fake data for demonstration:
library(dplyr)
df <- data.frame(stringsAsFactors = F,
col1 = rep(c("A","B"), each = 9),
col2 = rep(1:3),
value = 1:18)
#> df
# col1 col2 value
#1 A 1 1
#2 A 2 2
#3 A 3 3
#4 A 1 4
#5 A 2 5
#6 A 3 6
#7 A 1 7
#8 A 2 8
#9 A 3 9
#10 B 1 10
#11 B 2 11
#12 B 3 12
#13 B 1 13
#14 B 2 14
#15 B 3 15
#16 B 1 16
#17 B 2 17
#18 B 3 18
Solution
df %>%
group_by(col1, col2) %>%
summarise(col2_ttl = sum(value)) %>% # Count is boring for this data, but you
mutate(share_of_col1 = col2_ttl / sum(col2_ttl)) #... could use `n()` for that
## A tibble: 6 x 4
## Groups: col1 [2]
# col1 col2 col2_ttl share_of_col1
# <chr> <int> <int> <dbl>
#1 A 1 12 0.267
#2 A 2 15 0.333
#3 A 3 18 0.4
#4 B 1 39 0.310
#5 B 2 42 0.333
#6 B 3 45 0.357
First we group by both columns. In this case, the ordering makes a difference, because the groups are created hierarchically, and each summary we run summarizes the last layer of grouping. So the summarise line (or summarize, it was written with UK spelling but with US spelling aliases) sums up the values in each col1-col2 combination, leaving a residual grouping by col1 which we can use in the next line. (Try putting a # after sum(value)) to see what is produced at that stage.)
In the last line, the col2_ttl is divided by the sum of all the col2_ttl in its group, ie the total across each col1.

use value from a col to choose value from another col, put into new df in R

I have a df like this
name <- c("Fred","Mark","Jen","Simon","Ed")
a_or_b <- c("a","a","b","a","b")
abc_ah_one <- c(3,5,2,4,7)
abc_bh_one <- c(5,4,1,9,8)
abc_ah_two <- c(2,1,3,7,6)
abc_bh_two <- c(3,6,8,8,5)
abc_ah_three <- c(5,4,7,6,2)
abc_bh_three <- c(9,7,2,1,4)
def_ah_one <- c(1,3,9,2,7)
def_bh_one <- c(2,8,4,6,1)
def_ah_two <- c(4,7,3,2,5)
def_bh_two <- c(5,2,9,8,3)
def_ah_three <- c(8,5,3,5,2)
def_bh_three <- c(2,7,4,3,0)
df <- data.frame(name,a_or_b,abc_ah_one,abc_bh_one,abc_ah_two,abc_bh_two,
abc_ah_three,abc_bh_three,def_ah_one,def_bh_one,
def_ah_two,def_bh_two,def_ah_three,def_bh_three)
I want to use the value in column "a_or_b" to choose the values in each of the corresponding "ah/bh" columns for each "abc" (one, two, and three), and put it into a new data frame. For example, Fred would have the values 3, 2 and 5 in his row in the new df. Those values represent the values of each of his "ah" categories for the abc columns. Jen, who has "b" in her a_or_b column, would have all of her "bh" values from her abc columns for her row in the new df. Here is what my desired output would look like:
combo_one <- c(3,5,1,4,8)
combo_two <- c(2,1,8,7,5)
combo_three <- c(5,4,2,6,4)
df2 <- data.frame(name,a_or_b,combo_one,combo_two,combo_three)
I've attempted this using sapply. The following gives me a matrix of the correct column correct indexes of df[grep("abc",colnames(df),fixed=TRUE)] for each row:
sapply(paste0(df$a_or_b,"h"),grep,colnames(df[grep("abc",colnames(df),fixed=TRUE)]))

First we gather your data into a tidy long format, then break out the columns into something useful. After that the filtering is simple, and if necessary we can convert back to an difficult wide format:
library(dplyr)
library(tidyr)
gather(df, key = "var", value = "val", -name, -a_or_b) %>%
separate(var, into = c("combo", "h", "ind"), sep = "_") %>%
mutate(h = substr(h, 1, 1)) %>%
filter(a_or_b == h, combo == "abc") %>%
arrange(name) -> result_long
result_long
# name a_or_b combo h ind val
# 1 Ed b abc b one 8
# 2 Ed b abc b two 5
# 3 Ed b abc b three 4
# 4 Fred a abc a one 3
# 5 Fred a abc a two 2
# 6 Fred a abc a three 5
# 7 Jen b abc b one 1
# 8 Jen b abc b two 8
# 9 Jen b abc b three 2
# 10 Mark a abc a one 5
# 11 Mark a abc a two 1
# 12 Mark a abc a three 4
# 13 Simon a abc a one 4
# 14 Simon a abc a two 7
# 15 Simon a abc a three 6
spread(result_long, key = ind, value = val) %>%
select(name, a_or_b, one, two, three)
# name a_or_b one two three
# 1 Ed b 8 5 4
# 2 Fred a 3 2 5
# 3 Jen b 1 8 2
# 4 Mark a 5 1 4
# 5 Simon a 4 7 6

Base R approach would be using lapply, we loop through each row of the dataframe, create a string to find similar columns using paste0 based on a_or_b column and then rbind all the values together for each row.
new_df <- do.call("rbind", lapply(seq(nrow(df)), function(x)
setNames(df[x, grepl(paste0("abc_",df[x,"a_or_b"], "h"), colnames(df))],
c("combo_one", "combo_two", "combo_three"))))
new_df
# combo_one combo_two combo_three
#1 3 2 5
#2 5 1 4
#3 1 8 2
#4 4 7 6
#5 8 5 4
We can cbind the required columns then :
cbind(df[c(1, 2)], new_df)
# name a_or_b combo_one combo_two combo_three
#1 Fred a 3 2 5
#2 Mark a 5 1 4
#3 Jen b 1 8 2
#4 Simon a 4 7 6
#5 Ed b 8 5 4

It's possible to do this with a combination of map and mutate:
require(tidyverse)
df %>%
select(name, a_or_b, starts_with("abc")) %>%
rename_if(is.numeric, funs(sub("abc_", "", .))) %>%
mutate(combo_one = map_chr(a_or_b, ~ paste0(.x,"h_one")),
combo_one = !!combo_one,
combo_two = map_chr(a_or_b, ~ paste0(.x,"h_two")),
combo_two = !!combo_two,
combo_three = map_chr(a_or_b, ~ paste0(.x,"h_three")),
combo_three = !!combo_three) %>%
select(name, a_or_b, starts_with("combo"))
Output:
name a_or_b combo_one combo_two combo_three
1 Fred a 3 2 5
2 Mark a 5 1 4
3 Jen b 1 8 2
4 Simon a 4 7 6
5 Ed b 8 5 4

calculate summary by group and bring value back in the dataframe [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
df <- data.frame(
id = c('A1','A2','A4','A2','A1','A4','A3','A2','A1','A3'),
value = c(4,3,1,3,4,6,6,1,8,4))
I want to get max value within each id group. I tried following but got an error saying replacement has 4 rows and data has 10 which i understand but don't know how to correct
df$max.by.id <- aggregate(value ~ id, df, max)
this is how i ended up successfully doing it
max.by.id <- aggregate(value ~ id, df, max)
names(max.by.id) <- c("id", "max")
df2 <- merge(df,max.by.id, by.x = "id", by.y = "id")
df2
# id value max
#1 A1 4 8
#2 A1 4 8
#3 A1 8 8
#4 A2 3 3
#5 A2 3 3
#6 A2 1 3
#7 A3 6 6
#8 A3 4 6
#9 A4 1 6
#10 A4 6 6
any better way? thanks in advance

ave() is the function for that task:
df$max.by.id <- ave(df$value, df$id, FUN=max)
example:
df <- data.frame(
id = c('A1','A2','A4','A2','A1','A4','A3','A2','A1','A3'),
value = c(4,3,1,3,4,6,6,1,8,4))
df$max.by.id <- ave(df$value, df$id, FUN=max)
The result of ave() has the same length as the original vector of values (what is also the length of the grouping variables). The values of the result are going to the right positions with respect to the grouping variables. For more information read the documentation of ave().

with data.table, you can compute the max by id "inside" the data, automatically adding the newly computed value (unique by id):
library(data.table)
setDT(df)[, max.by.id := max(value), by=id]
df
# id value max.by.id
# 1: A1 4 8
# 2: A2 3 3
# 3: A4 1 6
# 4: A2 3 3
# 5: A1 4 8
# 6: A4 6 6
# 7: A3 6 6
# 8: A2 1 3
# 9: A1 8 8
#10: A3 4 6

tapply(df$value, df$id, max)
# A1 A2 A3 A4
8 3 6 6
library(plyr)
ddply(df, .(id), function(df){max(df$value)})
# id V1
# 1 A1 8
# 2 A2 3
# 3 A3 6
# 4 A4 6
library(dplyr)
df %>% group_by(id) %>% arrange(desc(value)) %>% do(head(., 1))
# Source: local data frame [4 x 2]
# Groups: id [4]
# id value
# (fctr) (dbl)
# 1 A1 8
# 2 A2 3
# 3 A3 6
# 4 A4 6
UPDATE:
If you need to keep the raw value, use the following code.
library(plyr)
ddply(df, .(id), function(df){
df$max.val = max(df$value)
return(df)
})
library(dplyr)
df %>% group_by(id) %>% mutate(max.val=max(value))
# Source: local data frame [10 x 3]
# Groups: id [4]
# id value max.val
# (fctr) (dbl) (dbl)
# 1 A1 4 8
# 2 A2 3 3
# 3 A4 1 6
# 4 A2 3 3
# 5 A1 4 8
# 6 A4 6 6
# 7 A3 6 6
# 8 A2 1 3
# 9 A1 8 8
# 10 A3 4 6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Nonparametric test to compare rows in different dataframes in R - r

Related

Infill missing variables of a df from a list

R calculating the sum of values according to condition

Group data hierarchically on two levels, then compute relative frequencies in R using dplyr [duplicate]

use value from a col to choose value from another col, put into new df in R

calculate summary by group and bring value back in the dataframe [duplicate]

Categories

Resources