I have a sample dataset as below:
Day<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
Group<-c("A","A","A","B","B","B","C","C","C","A","A","A","A","B","B","B","C","C","C")
Rain<-c(4,4,6,5,3,4,5,5,3,6,6,6,5,3,3,3,2,5,2)
UV<-c(6,6,7,8,5,6,5,6,6,6,7,7,8,8,5,6,8,5,7)
dat<-data.frame(Day,Group,Rain,UV)
I want to run a Kruskal Wallis test among 'A','B' and 'C' in "Group" for the variables "Rain" and "UV".
At present, I am subsetting the variables one by one for Kruskal test as below:
dat_Rain<-dat%>%select(c(Day,Group,Rain))
library(rstatix)
library(tidyverse)
dat_Rain%>%
group_by(Day) %>%
kruskal_test(Rain ~ Group)
How do I reiterate Kruskal test for multiple variables (Rain,UV) in this dataset? Thanks.
You can define the columns that you want to apply kruskal_test and use map_df to get all the values in one dataframe.
library(rstatix)
library(tidyverse)
cols <- c('Rain', 'UV')
map_df(cols, ~dat %>% group_by(Day) %>% kruskal_test(reformulate('Group', .x)))
# Day .y. n statistic df p method
# <dbl> <chr> <int> <dbl> <int> <dbl> <chr>
#1 1 Rain 9 0.505 2 0.777 Kruskal-Wallis
#2 2 Rain 10 6.52 2 0.0384 Kruskal-Wallis
#3 1 UV 9 1.16 2 0.56 Kruskal-Wallis
#4 2 UV 10 0.423 2 0.809 Kruskal-Wallis
Using lapply and making use of a helper function this could be achieved like so:
Additionally I made use of bind_rows to bind the resulting list into one data frame.
Day<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
Group<-c("A","A","A","B","B","B","C","C","C","A","A","A","A","B","B","B","C","C","C")
Rain<-c(4,4,6,5,3,4,5,5,3,6,6,6,5,3,3,3,2,5,2)
UV<-c(6,6,7,8,5,6,5,6,6,6,7,7,8,8,5,6,8,5,7)
dat<-data.frame(Day,Group,Rain,UV)
library(rstatix)
library(tidyverse)
kt <- function(x, data) {
fmla <- as.formula(paste(x, "~ Group"))
data %>%
group_by(Day) %>%
kruskal_test(fmla)
}
lapply(c("Rain", "UV"), kt, data = dat) %>%
bind_rows()
#> # A tibble: 4 x 7
#> Day .y. n statistic df p method
#> <dbl> <chr> <int> <dbl> <int> <dbl> <chr>
#> 1 1 Rain 9 0.505 2 0.777 Kruskal-Wallis
#> 2 2 Rain 10 6.52 2 0.0384 Kruskal-Wallis
#> 3 1 UV 9 1.16 2 0.56 Kruskal-Wallis
#> 4 2 UV 10 0.423 2 0.809 Kruskal-Wallis
Related
I am working on a large dataset of offspring sex ratio from +36,000 individuals of over 1,000 species. I want to see if the median sex ratio of each species significantly differs from .5. I am using a one-sample wilcoxon to do this. Here is an example dataset:
n<-100
dat<-data.frame(species=rep(LETTERS[1:5],n/5), SR=sample((1:100)/100,n,replace=TRUE))
When I run the following code, I get results where all p-values are the same.
library(dyplr)
res <- dat %>% group_by(species) %>%
do(w=wilcox.test(dat$SR,mu=.5,alternative=("two.sided"))) %>%
summarize(species,wilcox=w$p.value)
res
#OUTPUT#
# # A tibble: 5 x 2
species wilcox
<chr> <dbl>
1 A 0.465
2 B 0.465
3 C 0.465
4 D 0.465
5 E 0.465
Any idea what I'm doing wrong and how I can fix this?
The function do() is superseded and should not be used anymore. You can do the same within summarize() with across().
First you just group by species then you use across() within summarize() to access the values for each group and calculate the wilcoxon test and directly extract its p-value with $p.value at the end of the expression.
Mind that I set exact = FALSE to prevent the calculation of exact p-values as the sample is to small and it otherwise generates a warning. For your real data you can exclude this statement if your data sample is larger. For more information see this information.
n<-100
dat<-data.frame(species=rep(LETTERS[1:5],n/5), SR=sample((1:100)/100,n,replace=TRUE))
library(dplyr)
dat %>%
group_by(species) %>%
summarize(wilcox = across(SR,
~wilcox.test(.,
mu=.5,
alternative=("two.sided"),
exact = FALSE)$p.value)$SR)
#> # A tibble: 5 × 2
#> species wilcox$SR
#> <chr> <dbl>
#> 1 A 0.737
#> 2 B 0.0105
#> 3 C 0.751
#> 4 D 0.380
#> 5 E 0.614
Created on 2022-08-19 with reprex v2.0.2
I'm trying to get correlation matrices of an arbitrary number of factors by group, ideally using dplyr. I have no problem getting the correlation matrix by filtering by group and summarizing, but using a "group_by", I'm not sure how to pass the factor data to cor.
library(dplyr)
numRows <- 20
myData <- tibble(A = rnorm(numRows),
B = rnorm(numRows),
C = rnorm(numRows),
Group = c(rep("Group1", numRows/2), rep("Group2", numRows/2)))
# Essentially what I'm doing is trying to get these matrices, but for all groups
myData %>%
filter(Group == "Group1") %>%
select(-Group) %>%
summarize(CorMat = cor(.))
# However, I don't know what to pass into "cor". The code below fails
myData %>%
group_by(Group) %>%
summarize(CorMat = cor(.))
# Error looks like this
Error: Problem with `summarise()` column `CorMat`.
i `CorMat = cor(.)`.
x 'x' must be numeric
i The error occurred in group 1: Group = "Group1".
I've seen solutions for the grouped correlation between specific factors (Correlation matrix by group) or correlations between all factors to a specific factor (Correlation matrix of grouped variables in dplyr), but nothing for a grouped correlation matrix of all factors to all factors.
You can try using nest_by which will put you data (without Group) into a list column called data. Then you can refer to this column using cor:
myData %>%
nest_by(Group) %>%
summarise(CorMat = cor(data))
Output
Group CorMat[,1] [,2] [,3]
<chr> <dbl> <dbl> <dbl>
1 Group1 1 -0.132 0.638
2 Group1 -0.132 1 -0.284
3 Group1 0.638 -0.284 1
4 Group2 1 0.429 -0.228
5 Group2 0.429 1 -0.235
6 Group2 -0.228 -0.235 1
If you want a named list of matrices, you can also try the following. You can add split (or try group_split without names) and then map to remove the Group column.
library(tidyverse)
myData %>%
nest_by(Group) %>%
summarise(CorMat = cor(data)) %>%
ungroup %>%
split(f = .$Group) %>%
map(~ .x %>% select(-Group))
Output
$Group1
# A tibble: 3 x 1
CorMat[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 1 -0.132 0.638
2 -0.132 1 -0.284
3 0.638 -0.284 1
$Group2
# A tibble: 3 x 1
CorMat[,1] [,2] [,3]
<dbl> <dbl> <dbl>
1 1 0.429 -0.228
2 0.429 1 -0.235
3 -0.228 -0.235 1
I am using WRS2 to carry out robust pairwise comparisons. But one problem is that it removes the group level names from the output dataframes and saves it in a different object.
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# comparisons
x$comp
#> Group Group psihat ci.lower ci.upper p.value
#> [1,] 1 2 -1.0 -3.440879 1.44087853 0.25984505
#> [2,] 1 3 -2.8 -5.536161 -0.06383861 0.04914871
#> [3,] 2 3 -1.8 -4.536161 0.93616139 0.17288911
# vector with group level names
x$fnames
#> [1] "placebo" "low" "high"
I can convert it to a tibble:
# converting to tibble
suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -1 -3.44 1.44 0.260
#> 2 1 3 -2.8 -5.54 -0.0638 0.0491
#> 3 2 3 -1.8 -4.54 0.936 0.173
I would then like to replace the group column numeric values with actual names included in fnames (so map fnames[1] -> 1, fnames[2] -> 2, and so on).
So the final dataframe should look something like the following-
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
In this case, it was easy to just copy-paste the three values, but I want to have a generalizable approach where no matter the number of levels, it works. How can I do this using dplyr?
Using a named vector to match with tidyverse. This matches by value and not by the sequence of index i.e. if the value in 'Group' columns are not in a sequence or character, this would still work
library(dplyr)
as_tibble(x$comp, .name_repair = 'unique') %>%
mutate(across(starts_with("Group"),
~ setNames(x$fnames, seq_along(x$fnames))[as.character(.)]))
Does this fullfil your needs :
names <- c("A","B","C")
df = data.frame(group=c(1,2,3))
library(dplyr)
df %>% mutate(group = names[group])
group
1 A
2 B
3 C
Here's an approach using the recode function, with the recoding vector built programmatically from the data:
# Setup
set.seed(123)
library(WRS2)
library(tidyverse)
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# Create recoding vector
recode.vec = x$fnames %>% set_names(1:length(x$fnames))
# Recode columns
x.comp = x$comp %>%
as_tibble(.name_repair=make.unique) %>%
mutate(across(starts_with("Group"), ~recode(., !!!recode.vec)))
Output:
x.comp
#> # A tibble: 3 x 6
#> Group Group.1 psihat ci.lower ci.upper p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
Try this tidyverse approach formating data to long after extracting the objects as tibbles. You can use left_join() to get your groups as you want. Here the code to get something close to what you want:
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
#Transform to tibble
df1 <- suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#Extract labels
df2 <- tibble(treat=x$fnames) %>% mutate(value=1:n())
#Format to long df1
df1 <- df1 %>%
mutate(id=1:n()) %>%
pivot_longer(cols = c(group1,group2)) %>%
rename(group=name) %>% left_join(df2) %>% select(-value) %>%
pivot_wider(names_from = group,values_from=treat) %>% select(-id)
Output:
# A tibble: 3 x 6
psihat ci.lower ci.upper p.value group1 group2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1 -3.44 1.44 0.260 placebo low
2 -2.8 -5.54 -0.0638 0.0491 placebo high
3 -1.8 -4.54 0.936 0.173 low high
I wish to estimate models in one dataframe, but the formula for each model has some "moving parts" which come from another dataframe. For example, say I wish to estimate the following model (I can't post picture and found no way to type latex equations):
mpg = a + b*log(w_1 * drat + w_2 * hp)
where w_1 and w_2 are weights, which for example are either 0.5 or 1. I use expand.grid() to create a dataframe of weights, then mutate() a formula using paste() or paste0() with the variable names and the value of the weights, and then pass it to the lm() function.
However, the model estimated is just using the formula found in the first row of the weights dataframe. This is solved if I use group_by() before estimating the models.
The question is - why? why doesn't the first code work? what does group_by() achieve here that makes it possible?
library(tidyverse)
cars <- mtcars
w <- seq(from=0.5, to=1, by=0.5)
weights <- as_tibble(expand.grid(w1=w,w2=w))
#Doesn't work - the lm model is fit using the formula from the first row only
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
mutate(r2 = summary(lm(data=cars, formula = formula_weights))$r.squared)
#Does work - model is fit using the w1 and w2 values from each row (formula_weights)
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
group_by(formula_weights) %>%
mutate(r2 = summary(lm(data=cars, formula = formula_weights))$r.squared)
The output without group_by():
# A tibble: 4 x 4
w1 w2 formula_weights r2
<dbl> <dbl> <chr> <dbl>
1 0.5 0.5 mpg~log(0.5*drat+0.5*hp) 0.715
2 1 0.5 mpg~log(1*drat+0.5*hp) 0.715
3 0.5 1 mpg~log(0.5*drat+1*hp) 0.715
4 1 1 mpg~log(1*drat+1*hp) 0.715
The output with group_by():
# A tibble: 4 x 4
# Groups: formula_weights [4]
w1 w2 formula_weights r2
<dbl> <dbl> <chr> <dbl>
1 0.5 0.5 mpg~log(0.5*drat+0.5*hp) 0.715
2 1 0.5 mpg~log(1*drat+0.5*hp) 0.709
3 0.5 1 mpg~log(0.5*drat+1*hp) 0.718
4 1 1 mpg~log(1*drat+1*hp) 0.715
We can add rowwise
library(dplyr)
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
rowwise() %>%
mutate(r2 = summary(lm(data=cars, formula = formula_weights))$r.squared)
#Source: local data frame [4 x 4]
#Groups: <by row>
# A tibble: 4 x 4
# w1 w2 formula_weights r2
# <dbl> <dbl> <chr> <dbl>
#1 0.5 0.5 mpg~log(0.5*drat+0.5*hp) 0.715
#2 1 0.5 mpg~log(1*drat+0.5*hp) 0.709
#3 0.5 1 mpg~log(0.5*drat+1*hp) 0.718
#4 1 1 mpg~log(1*drat+1*hp) 0.715
Or use map
library(purrr)
weights %>%
mutate(r2 = map_dbl(paste0("mpg~log(",w1,"*drat+",w2,"*hp)"), ~
summary(lm(data = cars, formula = .x))$r.squared))
# A tibble: 4 x 3
# w1 w2 r2
# <dbl> <dbl> <dbl>
#1 0.5 0.5 0.715
#2 1 0.5 0.709
#3 0.5 1 0.718
#4 1 1 0.715
use sapply inside your mutate. summary/lm are not vectorized
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
mutate(r2 = sapply(formula_weights,
function(fw) summary(lm(data=cars, formula =))$r.squared))
I have this simple dataframe. The sum column represents the sum of the row. I would like to use prop.test to determine the P-value for each column, and present that data as an additional row labeled p-value. I can use prop.test in the following way to determine a p value for any individual column, but cannot work out how to apply that to multiple columns with a single function.
Other Island N_Shelf N_Shore S_Shore Sum
Type1 10 4 1 0 3 18
Type2 19 45 1 9 11 85
This will output a p-value for the island column
ResI2<- prop.test(x=TableAvE_Island$Island, n=TableAvE_Island$Sum)
output:
data: TableAvE_Island$Island out of TableAvE_Island$Sum
X-squared = 4.456, df = 1, p-value = 0.03478
alternative hypothesis: two.sided
95 percent confidence interval:
-0.56027107 -0.05410802
sample estimates:
prop 1 prop 2
0.2222222 0.5294118
I've tried to use the apply command but cannot work out its usage, and the examples i've been able to find dont seem similar enough. Any pointers would be appreciated.
Here's a look with broom's function tidy, which takes output from tests and other operations and formats them as "tidy" data frames.
For the first prop.test that you posted, the tidy output looks like this:
library(tidyverse)
broom::tidy(prop.test(TableAvE_Island$Island, TableAvE_Island$Sum))
#> estimate1 estimate2 statistic p.value parameter conf.low
#> 1 0.2222222 0.5294118 4.456017 0.03477849 1 -0.5602711
#> conf.high
#> 1 -0.05410802
#> method
#> 1 2-sample test for equality of proportions with continuity correction
#> alternative
#> 1 two.sided
To do this for all the variables in your data frame vs Sum, I gathered it into a long shape
table_long <- gather(TableAvE_Island, key = variable, value = val, -Sum)
head(table_long)
#> # A tibble: 6 x 3
#> Sum variable val
#> <int> <chr> <int>
#> 1 18 Other 10
#> 2 85 Other 19
#> 3 18 Island 4
#> 4 85 Island 45
#> 5 18 N_Shelf 1
#> 6 85 N_Shelf 1
Then grouped the long-shaped data by variable, pipe it into do, which allows you to call a function on each of the groups in a data frame, using . as a standing for the subset of the data. Then I called tidy on the column containing the nested results of the prop.test. This gives you a data frame of all the relevant results of the test, with each of "Island", "N_Shelf", etc shown.
table_long %>%
group_by(variable) %>%
do(test = prop.test(x = .$val, n = .$Sum)) %>%
broom::tidy(test)
#> # A tibble: 5 x 10
#> # Groups: variable [5]
#> variable estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Island 0.222 0.529 4.46 0.0348 1 -0.560
#> 2 N_Shelf 0.0556 0.0118 0.0801 0.777 1 -0.0981
#> 3 N_Shore 0 0.106 0.972 0.324 1 -0.205
#> 4 Other 0.556 0.224 6.54 0.0106 1 0.0523
#> 5 S_Shore 0.167 0.129 0.00163 0.968 1 -0.183
#> # ... with 3 more variables: conf.high <dbl>, method <fct>,
#> # alternative <fct>
Created on 2018-05-10 by the reprex package (v0.2.0).
We could gather into 'long' format and then store it as a list column
library(tidyverse)
res <- gather(TableAvE_Island, key, val, -Sum) %>%
group_by(key) %>%
nest() %>%
mutate(out = map(data, ~prop.test(.x$val, .x$Sum)))
res$out