I am using WRS2 to carry out robust pairwise comparisons. But one problem is that it removes the group level names from the output dataframes and saves it in a different object.
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# comparisons
x$comp
#> Group Group psihat ci.lower ci.upper p.value
#> [1,] 1 2 -1.0 -3.440879 1.44087853 0.25984505
#> [2,] 1 3 -2.8 -5.536161 -0.06383861 0.04914871
#> [3,] 2 3 -1.8 -4.536161 0.93616139 0.17288911
# vector with group level names
x$fnames
#> [1] "placebo" "low" "high"
I can convert it to a tibble:
# converting to tibble
suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -1 -3.44 1.44 0.260
#> 2 1 3 -2.8 -5.54 -0.0638 0.0491
#> 3 2 3 -1.8 -4.54 0.936 0.173
I would then like to replace the group column numeric values with actual names included in fnames (so map fnames[1] -> 1, fnames[2] -> 2, and so on).
So the final dataframe should look something like the following-
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
In this case, it was easy to just copy-paste the three values, but I want to have a generalizable approach where no matter the number of levels, it works. How can I do this using dplyr?
Using a named vector to match with tidyverse. This matches by value and not by the sequence of index i.e. if the value in 'Group' columns are not in a sequence or character, this would still work
library(dplyr)
as_tibble(x$comp, .name_repair = 'unique') %>%
mutate(across(starts_with("Group"),
~ setNames(x$fnames, seq_along(x$fnames))[as.character(.)]))
Does this fullfil your needs :
names <- c("A","B","C")
df = data.frame(group=c(1,2,3))
library(dplyr)
df %>% mutate(group = names[group])
group
1 A
2 B
3 C
Here's an approach using the recode function, with the recoding vector built programmatically from the data:
# Setup
set.seed(123)
library(WRS2)
library(tidyverse)
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# Create recoding vector
recode.vec = x$fnames %>% set_names(1:length(x$fnames))
# Recode columns
x.comp = x$comp %>%
as_tibble(.name_repair=make.unique) %>%
mutate(across(starts_with("Group"), ~recode(., !!!recode.vec)))
Output:
x.comp
#> # A tibble: 3 x 6
#> Group Group.1 psihat ci.lower ci.upper p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
Try this tidyverse approach formating data to long after extracting the objects as tibbles. You can use left_join() to get your groups as you want. Here the code to get something close to what you want:
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
#Transform to tibble
df1 <- suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#Extract labels
df2 <- tibble(treat=x$fnames) %>% mutate(value=1:n())
#Format to long df1
df1 <- df1 %>%
mutate(id=1:n()) %>%
pivot_longer(cols = c(group1,group2)) %>%
rename(group=name) %>% left_join(df2) %>% select(-value) %>%
pivot_wider(names_from = group,values_from=treat) %>% select(-id)
Output:
# A tibble: 3 x 6
psihat ci.lower ci.upper p.value group1 group2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1 -3.44 1.44 0.260 placebo low
2 -2.8 -5.54 -0.0638 0.0491 placebo high
3 -1.8 -4.54 0.936 0.173 low high
Related
I am running an experiment where participants are randomly assigned to one of two conditions, and then I collect data on several variables. Here is an example of my code:
df <- data.frame(condition =c(1,1,1,1,1,-1,-1,-1,-1,-1),
var1 = c(6,6,4,7,5,6,6,6,4,7),
var2 = c(3,4,3,6,7,1,2,1,2,5),
var3 = c(2,2,6,6,7,1,7,7,3,1),
var4 = c(6,4,3,6,4,1,3,3,4,4))
df$condition = factor(df$condition, levels = c(-1,1),labels = c("Digital","Physical"))
For each variable (var1, var2, etc.) I would like a little table with the count, mean, and standard deviation. This code creates the kind of table that I want:
group_by(df, df$condition) %>%
summarise(
count = n(),
mean = mean(var1),
sd = sd(var1))
But because I have many variables, I would like to use some kind of loop (or "lapply"?) to create all these tables at once. It would also be great if each table could show the name of the variable. Thanks!
You can just use summarise on all the variables, i.e.
library(dplyr)
group_by(df, condition) %>%
summarise(across(everything(), ~ c(count = n(), mean = mean(.), sd = sd(.))))
`summarise()` has grouped output by 'condition'. You can override using the `.groups` argument.
# A tibble: 6 x 5
# Groups: condition [2]
condition var1 var2 var3 var4
<fct> <dbl> <dbl> <dbl> <dbl>
1 Digital 5 5 5 5
2 Digital 5.8 2.2 3.8 3
3 Digital 1.10 1.64 3.03 1.22
4 Physical 5 5 5 5
5 Physical 5.6 4.6 4.6 4.6
6 Physical 1.14 1.82 2.41 1.34
You can control the output structure by changing object in the formula, i.e.
group_by(df, condition) %>%
summarise(across(everything(), ~ data.frame(count = n(), mean = mean(.), sd = sd(.))))
# A tibble: 2 x 5
condition var1$count $mean $sd var2$count $mean $sd var3$count $mean $sd var4$count $mean $sd
<fct> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 Digital 5 5.8 1.10 5 2.2 1.64 5 3.8 3.03 5 3 1.22
2 Physical 5 5.6 1.14 5 4.6 1.82 5 4.6 2.41 5 4.6 1.34
We could still do it my summarise using a list:
library(dplyr)
df %>%
group_by(condition) %>%
summarise(across(starts_with("var"), .f = list(n = ~n(),
mean = mean,
sd = sd), na.rm = TRUE))
condition var1_n var1_mean var1_sd var2_n var2_mean var2_sd var3_n var3_mean var3_sd var4_n var4_mean var4_sd
<dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 -1 5 5.8 1.10 5 2.2 1.64 5 3.8 3.03 5 3 1.22
2 1 5 5.6 1.14 5 4.6 1.82 5 4.6 2.41 5 4.6 1.34
df <- data.frame(condition =c(1,1,1,1,1,-1,-1,-1,-1,-1),
var1 = c(6,6,4,7,5,6,6,6,4,7),
var2 = c(3,4,3,6,7,1,2,1,2,5),
var3 = c(2,2,6,6,7,1,7,7,3,1),
var4 = c(6,4,3,6,4,1,3,3,4,4))
df$condition = factor(df$condition, levels = c(-1,1),labels = c("Digital","Physical"))
for (var in names(df)[2:length(names(df))]){
tab <- group_by(df, condition) %>%
select(c("condition", var)) %>%
dplyr::rename(v = var) %>%
summarise(
count = n(),
mean = mean(v),
sd = sd(v)
)
print(var)
print(tab)
}
gives
[1] "var1"
# A tibble: 2 × 4
condition count mean sd
<fct> <int> <dbl> <dbl>
1 Digital 5 5.8 1.10
2 Physical 5 5.6 1.14
[1] "var2"
# A tibble: 2 × 4
condition count mean sd
<fct> <int> <dbl> <dbl>
1 Digital 5 2.2 1.64
2 Physical 5 4.6 1.82
[1] "var3"
# A tibble: 2 × 4
condition count mean sd
<fct> <int> <dbl> <dbl>
1 Digital 5 3.8 3.03
2 Physical 5 4.6 2.41
[1] "var4"
# A tibble: 2 × 4
condition count mean sd
<fct> <int> <dbl> <dbl>
1 Digital 5 3 1.22
2 Physical 5 4.6 1.34
>
Rather than lapply, the function of choice is aggregate, a close relative to the *apply family at least. Put in a custom function f.
f <- \(x) c(n=length(x), mu=mean(x), sd=sd(x))
aggregate(. ~ condition, df, f)
# condition var1.n var1.mu var1.sd var2.n var2.mu var2.sd var3.n var3.mu var3.sd var4.n var4.mu var4.sd
# 1 Digital 5.000000 5.800000 1.095445 5.000000 2.200000 1.643168 5.000000 3.800000 3.033150 5.000000 3.000000 1.224745
# 2 Physical 5.000000 5.600000 1.140175 5.000000 4.600000 1.816590 5.000000 4.600000 2.408319 5.000000 4.600000 1.341641
If you want to aggregate on a specific set of variables (e.g. assembled with grep), use list notation instead.
aggregate(df[grep('^var', names(df))], df['condition'], f)
You can use gtsummary here if you need to present the results.
Example one below will make one table with all of your variables. Example two will split each variable into its own table (if you need them to be seperate)
library(gtsummary)
#example one:
tbl_summary(df, by = condition,
type = list(everything()~"continuous"),
statistic = list(all_continuous()~"{mean} ({sd}) "))
#example two:
tbl_summary(df, by = condition,
type = list(everything()~"continuous"),
statistic = list(all_continuous()~"{mean} ({sd}) ")) %>%
tbl_split(variables = c(var1, var2,var3,var4))
I have a sample dataset as below:
Day<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
Group<-c("A","A","A","B","B","B","C","C","C","A","A","A","A","B","B","B","C","C","C")
Rain<-c(4,4,6,5,3,4,5,5,3,6,6,6,5,3,3,3,2,5,2)
UV<-c(6,6,7,8,5,6,5,6,6,6,7,7,8,8,5,6,8,5,7)
dat<-data.frame(Day,Group,Rain,UV)
I want to run a Kruskal Wallis test among 'A','B' and 'C' in "Group" for the variables "Rain" and "UV".
At present, I am subsetting the variables one by one for Kruskal test as below:
dat_Rain<-dat%>%select(c(Day,Group,Rain))
library(rstatix)
library(tidyverse)
dat_Rain%>%
group_by(Day) %>%
kruskal_test(Rain ~ Group)
How do I reiterate Kruskal test for multiple variables (Rain,UV) in this dataset? Thanks.
You can define the columns that you want to apply kruskal_test and use map_df to get all the values in one dataframe.
library(rstatix)
library(tidyverse)
cols <- c('Rain', 'UV')
map_df(cols, ~dat %>% group_by(Day) %>% kruskal_test(reformulate('Group', .x)))
# Day .y. n statistic df p method
# <dbl> <chr> <int> <dbl> <int> <dbl> <chr>
#1 1 Rain 9 0.505 2 0.777 Kruskal-Wallis
#2 2 Rain 10 6.52 2 0.0384 Kruskal-Wallis
#3 1 UV 9 1.16 2 0.56 Kruskal-Wallis
#4 2 UV 10 0.423 2 0.809 Kruskal-Wallis
Using lapply and making use of a helper function this could be achieved like so:
Additionally I made use of bind_rows to bind the resulting list into one data frame.
Day<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
Group<-c("A","A","A","B","B","B","C","C","C","A","A","A","A","B","B","B","C","C","C")
Rain<-c(4,4,6,5,3,4,5,5,3,6,6,6,5,3,3,3,2,5,2)
UV<-c(6,6,7,8,5,6,5,6,6,6,7,7,8,8,5,6,8,5,7)
dat<-data.frame(Day,Group,Rain,UV)
library(rstatix)
library(tidyverse)
kt <- function(x, data) {
fmla <- as.formula(paste(x, "~ Group"))
data %>%
group_by(Day) %>%
kruskal_test(fmla)
}
lapply(c("Rain", "UV"), kt, data = dat) %>%
bind_rows()
#> # A tibble: 4 x 7
#> Day .y. n statistic df p method
#> <dbl> <chr> <int> <dbl> <int> <dbl> <chr>
#> 1 1 Rain 9 0.505 2 0.777 Kruskal-Wallis
#> 2 2 Rain 10 6.52 2 0.0384 Kruskal-Wallis
#> 3 1 UV 9 1.16 2 0.56 Kruskal-Wallis
#> 4 2 UV 10 0.423 2 0.809 Kruskal-Wallis
I wish to estimate models in one dataframe, but the formula for each model has some "moving parts" which come from another dataframe. For example, say I wish to estimate the following model (I can't post picture and found no way to type latex equations):
mpg = a + b*log(w_1 * drat + w_2 * hp)
where w_1 and w_2 are weights, which for example are either 0.5 or 1. I use expand.grid() to create a dataframe of weights, then mutate() a formula using paste() or paste0() with the variable names and the value of the weights, and then pass it to the lm() function.
However, the model estimated is just using the formula found in the first row of the weights dataframe. This is solved if I use group_by() before estimating the models.
The question is - why? why doesn't the first code work? what does group_by() achieve here that makes it possible?
library(tidyverse)
cars <- mtcars
w <- seq(from=0.5, to=1, by=0.5)
weights <- as_tibble(expand.grid(w1=w,w2=w))
#Doesn't work - the lm model is fit using the formula from the first row only
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
mutate(r2 = summary(lm(data=cars, formula = formula_weights))$r.squared)
#Does work - model is fit using the w1 and w2 values from each row (formula_weights)
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
group_by(formula_weights) %>%
mutate(r2 = summary(lm(data=cars, formula = formula_weights))$r.squared)
The output without group_by():
# A tibble: 4 x 4
w1 w2 formula_weights r2
<dbl> <dbl> <chr> <dbl>
1 0.5 0.5 mpg~log(0.5*drat+0.5*hp) 0.715
2 1 0.5 mpg~log(1*drat+0.5*hp) 0.715
3 0.5 1 mpg~log(0.5*drat+1*hp) 0.715
4 1 1 mpg~log(1*drat+1*hp) 0.715
The output with group_by():
# A tibble: 4 x 4
# Groups: formula_weights [4]
w1 w2 formula_weights r2
<dbl> <dbl> <chr> <dbl>
1 0.5 0.5 mpg~log(0.5*drat+0.5*hp) 0.715
2 1 0.5 mpg~log(1*drat+0.5*hp) 0.709
3 0.5 1 mpg~log(0.5*drat+1*hp) 0.718
4 1 1 mpg~log(1*drat+1*hp) 0.715
We can add rowwise
library(dplyr)
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
rowwise() %>%
mutate(r2 = summary(lm(data=cars, formula = formula_weights))$r.squared)
#Source: local data frame [4 x 4]
#Groups: <by row>
# A tibble: 4 x 4
# w1 w2 formula_weights r2
# <dbl> <dbl> <chr> <dbl>
#1 0.5 0.5 mpg~log(0.5*drat+0.5*hp) 0.715
#2 1 0.5 mpg~log(1*drat+0.5*hp) 0.709
#3 0.5 1 mpg~log(0.5*drat+1*hp) 0.718
#4 1 1 mpg~log(1*drat+1*hp) 0.715
Or use map
library(purrr)
weights %>%
mutate(r2 = map_dbl(paste0("mpg~log(",w1,"*drat+",w2,"*hp)"), ~
summary(lm(data = cars, formula = .x))$r.squared))
# A tibble: 4 x 3
# w1 w2 r2
# <dbl> <dbl> <dbl>
#1 0.5 0.5 0.715
#2 1 0.5 0.709
#3 0.5 1 0.718
#4 1 1 0.715
use sapply inside your mutate. summary/lm are not vectorized
weights %>%
mutate(formula_weights = paste0("mpg~log(",w1,"*drat+",w2,"*hp)")) %>%
mutate(r2 = sapply(formula_weights,
function(fw) summary(lm(data=cars, formula =))$r.squared))
I have this simple dataframe. The sum column represents the sum of the row. I would like to use prop.test to determine the P-value for each column, and present that data as an additional row labeled p-value. I can use prop.test in the following way to determine a p value for any individual column, but cannot work out how to apply that to multiple columns with a single function.
Other Island N_Shelf N_Shore S_Shore Sum
Type1 10 4 1 0 3 18
Type2 19 45 1 9 11 85
This will output a p-value for the island column
ResI2<- prop.test(x=TableAvE_Island$Island, n=TableAvE_Island$Sum)
output:
data: TableAvE_Island$Island out of TableAvE_Island$Sum
X-squared = 4.456, df = 1, p-value = 0.03478
alternative hypothesis: two.sided
95 percent confidence interval:
-0.56027107 -0.05410802
sample estimates:
prop 1 prop 2
0.2222222 0.5294118
I've tried to use the apply command but cannot work out its usage, and the examples i've been able to find dont seem similar enough. Any pointers would be appreciated.
Here's a look with broom's function tidy, which takes output from tests and other operations and formats them as "tidy" data frames.
For the first prop.test that you posted, the tidy output looks like this:
library(tidyverse)
broom::tidy(prop.test(TableAvE_Island$Island, TableAvE_Island$Sum))
#> estimate1 estimate2 statistic p.value parameter conf.low
#> 1 0.2222222 0.5294118 4.456017 0.03477849 1 -0.5602711
#> conf.high
#> 1 -0.05410802
#> method
#> 1 2-sample test for equality of proportions with continuity correction
#> alternative
#> 1 two.sided
To do this for all the variables in your data frame vs Sum, I gathered it into a long shape
table_long <- gather(TableAvE_Island, key = variable, value = val, -Sum)
head(table_long)
#> # A tibble: 6 x 3
#> Sum variable val
#> <int> <chr> <int>
#> 1 18 Other 10
#> 2 85 Other 19
#> 3 18 Island 4
#> 4 85 Island 45
#> 5 18 N_Shelf 1
#> 6 85 N_Shelf 1
Then grouped the long-shaped data by variable, pipe it into do, which allows you to call a function on each of the groups in a data frame, using . as a standing for the subset of the data. Then I called tidy on the column containing the nested results of the prop.test. This gives you a data frame of all the relevant results of the test, with each of "Island", "N_Shelf", etc shown.
table_long %>%
group_by(variable) %>%
do(test = prop.test(x = .$val, n = .$Sum)) %>%
broom::tidy(test)
#> # A tibble: 5 x 10
#> # Groups: variable [5]
#> variable estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Island 0.222 0.529 4.46 0.0348 1 -0.560
#> 2 N_Shelf 0.0556 0.0118 0.0801 0.777 1 -0.0981
#> 3 N_Shore 0 0.106 0.972 0.324 1 -0.205
#> 4 Other 0.556 0.224 6.54 0.0106 1 0.0523
#> 5 S_Shore 0.167 0.129 0.00163 0.968 1 -0.183
#> # ... with 3 more variables: conf.high <dbl>, method <fct>,
#> # alternative <fct>
Created on 2018-05-10 by the reprex package (v0.2.0).
We could gather into 'long' format and then store it as a list column
library(tidyverse)
res <- gather(TableAvE_Island, key, val, -Sum) %>%
group_by(key) %>%
nest() %>%
mutate(out = map(data, ~prop.test(.x$val, .x$Sum)))
res$out
I am interested in the total mean, and the mean within different conditions, of some measurements preferably using dplyr's summarise function.
I'll illustrate my question in the following. Say I have some data, borrowed form this this,
dta <- read.table(header=TRUE, text='
subject sex condition measurement
1 M control 7.9
1 M cond1 12.3
1 M cond2 10.7
2 F control 6.3
2 F cond1 10.6
2 F cond2 11.1
3 F control 9.5
3 F cond1 13.1
3 F cond2 13.8
4 M control 11.5
4 M cond1 13.4
4 M cond2 12.9
') # ; dta
I now want the mean for each sex and the mean by sex for each condition. I know how to get it for each condition, like this.
# install.packages(c("dplyr"), dependencies = TRUE)
library(dplyr)
dta %>%
group_by(sex, condition) %>%
summarise(
mean = mean(measurement)
)
#> # A tibble: 6 x 3
#> # Groups: sex [?]
#> sex condition mean
#> <fctr> <fctr> <dbl>
#> 1 F cond1 11.85
#> 2 F cond2 12.45
#> 3 F control 7.90
#> 4 M cond1 12.85
#> 5 M cond2 11.80
#> 6 M control 9.70
But, this does not give me the aggregate mean for both sexes. To get this I either have to run a separate call, i.e.
dta %>%
group_by(sex) %>%
summarise(
mean = mean(measurement)
)
#> # A tibble: 2 x 2
#> sex mean
#> <fctr> <dbl>
#> 1 F 10.73333
#> 2 M 11.45000
or deconstruct data structure. Like this,
# install.packages(c("tidyr"), dependencies = TRUE)
library(tidyr)
dta_wide <- spread(dta, condition, measurement)
dta_wide %>%
group_by(sex) %>%
summarise(
mean_tot = mean(cond1 + cond2 + control)/3,
mean_cond1 = mean(cond1),
mean_cond2 = mean(cond2),
mean_control = mean(control)
)
#> # A tibble: 2 x 5
#> sex mean_tot mean_cond1 mean_cond2 mean_control
#> <fctr> <dbl> <dbl> <dbl> <dbl>
#> 1 F 10.73333 11.85 12.45 7.9
#> 2 M 11.45000 12.85 11.80 9.7
This gives me an output with both the over all mean by sex and the individual mean by condition.
However, both running two separate calls and deconstructing data seems unnecessarily cumbersome. Isn't there a simply way to add a categorical variable, here condition, as the by variable and at the same time keep the aggregate information, here mean by sex? Maybe I am overlooking something logical and shouldn't be messing with data like this?
One option is to calculate the two summaries separately, then join back:
dta %>%
group_by(sex, condition) %>%
summarise(mean = mean(measurement)) %>%
inner_join(
group_by(dta, sex) %>%
summarise(mean_tot = mean(measurement))
)
# Joining, by = "sex"
# A tibble: 6 x 4
# Groups: sex [?]
# sex condition mean mean_tot
# <fctr> <fctr> <dbl> <dbl>
#1 F cond1 11.85 10.73333
#2 F cond2 12.45 10.73333
#3 F control 7.90 10.73333
#4 M cond1 12.85 11.45000
#5 M cond2 11.80 11.45000
#6 M control 9.70 11.45000
Or use group_by twice:
dta %>%
group_by(sex, condition) %>%
summarise(s = sum(measurement), n = n()) %>%
group_by(sex) %>%
transmute(condition, mean_tot = sum(s) / sum(n), mean = s / n)
# Adding missing grouping variables: `sex`
# A tibble: 6 x 4
# Groups: sex [2]
# sex condition mean_tot mean
# <fctr> <fctr> <dbl> <dbl>
#1 F cond1 10.73333 11.85
#2 F cond2 10.73333 12.45
#3 F control 10.73333 7.90
#4 M cond1 11.45000 12.85
#5 M cond2 11.45000 11.80
#6 M control 11.45000 9.70