Stack (rbind) two different table or tibble together - r

I want to stack two different table or tibble in R. But if I use rbind() or bind_rows(), I have a table but that is not what I want. Both don't have any common ID or variables. For example,
xx <- mtcars %>%
group_by(vs) %>%
summarize(mean(mpg), sd(mpg))
yy <- mtcars %>%
group_by(am) %>%
summarise(mean(wt), sd(wt))
I want to have this outcome:
am mean(wt) sd(wt)
0 3.77 0.777
1 2.41 0.617
vs mean(mpg) sd(mpg)
0 16.6 3.86
1 24.6 5.38
I have tried multiple different ways to do it, but haven't had a luck because of my limited R skill. I will appreciate if someone helps this problem. Thank you.

Try formating the data with equal standard names:
#Format data
nxx <- rbind(colnames(xx),xx)
nyy <- rbind(colnames(yy),yy)
#Assign common names
nxx <- set_names(nxx,paste0('V',1:dim(nxx)[2]))
nyy <- set_names(nyy,paste0('V',1:dim(nyy)[2]))
#Bind
ndf <- rbind(nxx,nyy)
Output:
# A tibble: 6 x 3
V1 V2 V3
<chr> <chr> <chr>
1 vs mean(mpg) sd(mpg)
2 0 16.6166666666667 3.86069941849919
3 1 24.5571428571429 5.37897821090647
4 am mean(wt) sd(wt)
5 0 3.76889473684211 0.777400146838225
6 1 2.411 0.616981631277085

We can use tidyverse methods. Convert the columns to character in each dataset, then add the first row (add_row) as the column name, and use bind_rows to bind the datasets
library(dplyr)
library(tibble)
library(stringr)
xx %>%
mutate(across(everything(), ~ as.character(round(., 2)))) %>%
add_row(!!! setNames(names(.), names(.)), .before = 1) %>%
rename_all(~ str_c('v', seq_along(.))) %>%
bind_rows(yy %>%
mutate(across(everything(), ~ as.character(round(., 2)))) %>%
add_row(!!! setNames(names(.), names(.)), .before = 1) %>%
rename_all(~ str_c('v', seq_along(.))))
# A tibble: 6 x 3
# v1 v2 v3
# <chr> <chr> <chr>
#1 vs mean(mpg) sd(mpg)
#2 0 16.62 3.86
#3 1 24.56 5.38
#4 am mean(wt) sd(wt)
#5 0 3.77 0.78
#6 1 2.41 0.62
Or this can be made slightly more compact
library(zeallot)
library(stringr)
library(purrr)
list(xx, yy) %>%
map_dfr( ~ destructure(.x) %>%
imap_dfr(~ c(.y, round(.x, 2))) %>%
rename_with(~ str_c('v', seq_along(.))))
# A tibble: 6 x 3
# v1 v2 v3
# <chr> <chr> <chr>
#1 vs mean(mpg) sd(mpg)
#2 0 16.62 3.86
#3 1 24.56 5.38
#4 am mean(wt) sd(wt)
#5 0 3.77 0.78
#6 1 2.41 0.62
Or using base R
do.call(rbind, lapply(list(xx, yy), function(x)
unname(rbind(colnames(x), round(x, 2)))))

Related

Perform an operation within grouped pairs in tidyverse

I have a dataset with three columns that are grouped by two variables.
df <- tibble(paper = rep(c("A_2012", "B_2019"), each = 5),
question = rep(c(1,2,3,4,5), 2),
rate = c(4.545455, 4.010000, 4.672727, 4.100000, 3.418182, 3.060000,
4.563636, 3.760000, 4.636364, 4.000000))
> df %>% group_by(question) %>%
select(question, paper, rate) %>%
arrange(question)
# A tibble: 10 x 3
# Groups: question [5]
question paper rate
<dbl> <.chr> <dbl>
1 1 A_2012 4.55
2 1 B_2019 3.06
3 2 A_2012 4.01
4 2 B_2019 4.56
5 3 A_2012 4.67
6 3 B_2019 3.76
7 4 A_2012 4.1
8 4 B_2019 4.64
9 5 A_2012 3.42
10 5 B_2019 4
I need to perform an operation within the 'rate' values of a group. But really, I do not know how to write the code using tidyverse style. In this example, I´ll get the difference (paperB - paperA) for each question:
> df_result
# A tibble: 5 x 2
question diff_rate
<dbl> <dbl>
1 1 -1.49
2 2 0.55
3 3 -0.91
4 4 0.54
5 5 0.58
I´ve tried using pivot_widerand then some operations but I have actually 54 different values for the variable paper, so it is not efficient.
Any help is truly appreciated.
you can do
df %>% group_by(question) %>%
select(question, paper, rate) %>%
arrange(question) %>% mutate(
diff_rate=diff(rate)
if you wanna the same format as your df_result, you can do
df %>% group_by(question) %>%
select(question, paper, rate) %>%
arrange(question) %>% mutate(
diff_rate=diff(rate)
) %>% select(question, diff_rate) %>% distinct()

Nested group_by operation in dplyr: does the second call include the first call?

In my data below, First, I'm want to group_by(study), and get the mean of X for each unique study value and subtract it from each X value in each study.
Second, and while groupe_by(study) is still in effect, I want to further group_by(outcome) within each study and get the mean of X for unique outcome value within a unique study value and subtract it from each X value in each outcome in each study.
I'm using the following workaround, but it seems it doesn't achieve my goal, because it seems the the group_by(outcome) call is ignoring the previous group_by(study).
Is there a way to achieve what I described above?
library(dplyr)
set.seed(0)
(data <- expand.grid(study = 1:2, outcome = rep(1:2,2)))
data$X <- rnorm(nrow(data))
(data <- arrange(data,study))
# study outcome X
#1 1 1 1.2629543
#2 1 2 1.3297993
#3 1 1 0.4146414
#4 1 2 -0.9285670
#5 2 1 -0.3262334
#6 2 2 1.2724293
#7 2 1 -1.5399500
#8 2 2 -0.2947204
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome) %>%
mutate(X_between_ou = mean(X), X_within_ou = X-X_between_ou)
Yes, the second group_by overwrites the previous group_by which can be checked with group_vars function.
library(dplyr)
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome) %>%
group_vars()
#[1] "outcome"
As you can see at this stage the data is grouped only by outcome.
You can achieve your goal by including .add = TRUE in group_by which will add to the existing groups.
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome, .add = TRUE) %>%
group_vars()
#[1] "study" "outcome"
So ultimately, now the code would become -
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome, .add = TRUE) %>%
mutate(X_between_ou = mean(X), X_within_ou = X-X_between_ou)
# study outcome X X_between_st X_within_st X_between_ou X_within_ou
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1.26 0.520 0.743 0.839 0.424
#2 1 2 1.33 0.520 0.810 0.201 1.13
#3 1 1 0.415 0.520 -0.105 0.839 -0.424
#4 1 2 -0.929 0.520 -1.45 0.201 -1.13
#5 2 1 -0.326 -0.222 -0.104 -0.933 0.607
#6 2 2 1.27 -0.222 1.49 0.489 0.784
#7 2 1 -1.54 -0.222 -1.32 -0.933 -0.607
#8 2 2 -0.295 -0.222 -0.0726 0.489 -0.784
We may use cur_group
data %>%
group_by(study) %>%
summarise(grps = names(cur_group())) %>%
slice(1) %>%
pull(grps)
[1] "study"

Is there a way to do a conditional and multiple row by row operation along a sorted and grouped tibble?

I have a grouped tibble where several parameters have to be calculated from others assuming a function that gets its values from a previous row. I have tried to find answers that involve lag, mutate, case_when, and aggregate but had no luck implementing these in the following toy dataset:
library(tidyverse)
set.seed(42)
df <- tibble(
gr = c(1,1,1,2,2,2),
t = rep((seq(1:3)),2),
v1 = c(1,NA,NA,1.6,NA,NA),
v2 = rnorm(6),
v3 = c(-0.2,0.3,-0.6,-0.2,1,0.2)
)
# These operations
(df <- df %>% group_by(gr) %>% arrange(t, .by_group = TRUE) %>%
mutate(R1=abs(v1-5*v2)) %>%
mutate(R2=abs(R1*v2)^(1/2)) %>% mutate(RI3=R1/R2))
# A tibble: 6 x 8
# Groups: gr [2]
gr t v1 v2 v3 R1 R2 RI3
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 -1.39 -0.2 7.94 3.32 2.39
2 1 2 NA -0.279 0.3 NA NA NA
3 1 3 NA -0.133 -0.6 NA NA NA
4 2 1 1.6 0.636 -0.2 1.58 1.00 1.58
5 2 2 NA -0.284 1 NA NA NA
6 2 3 NA -2.66 0.2 NA NA NA
Now, what I would need to do is to
use df$RI3[i-1] as input for df$v1[i]
if ia.na(df$v1[i]) is TRUE and subsequently calculate:
mutate(R1=abs(v1-5*v2)) %>% mutate(R2=(R1^(1/2))) %>% mutate(RI3=R1/R2)
row-by-row in order to fill the gaps within the sorted and grouped dataset;
doing this one by one would look like this:
Rdf <- df
Rdf$v1[2] <- df$RI3[1]
Rdf$v1[5] <- df$RI3[4]
Rdf <- Rdf %>% mutate(R1=abs(v1-5*v2)) %>%
mutate(R2=abs(R1*v2)^(1/2)) %>% mutate(RI3=R1/R2)
Rdf
Rdf$v1[3] <- Rdf$RI3[2]
Rdf$v1[6] <- Rdf$RI3[5]
Rdf <- Rdf %>% mutate(R1=abs(v1-5*v2)) %>%
mutate(R2=abs(R1*v2)^(1/2)) %>% mutate(RI3=R1/R2)
Rdf
and would result in:
# A tibble: 6 x 8
# Groups: gr [2]
gr t v1 v2 v3 R1 R2 RI3
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 -1.39 -0.2 7.94 3.32 2.39
2 1 2 2.39 -0.279 0.3 3.79 1.03 3.68
3 1 3 3.68 -0.133 -0.6 4.35 0.762 5.71
4 2 1 1.6 0.636 -0.2 1.58 1.00 1.58
5 2 2 1.58 -0.284 1 3.00 0.923 3.25
6 2 3 3.25 -2.66 0.2 16.5 6.63 2.49
I guess a for-loop within an if-condition applied to a nested df would work.
Any advise implementing this would be great!
I implemented a for loop. But I am not sure I start off with the same df given the seed. Hope it does what you need.
When I need to write for-loops that seem complicated, I use browser() to build it.
library(tibble)
library(dplyr)
set.seed(42)
df <- tibble(
gr = c(1,1,1,2,2,2),
t = rep((seq(1:3)),2),
v1 = c(1,NA,NA,1.6,NA,NA),
v2 = rnorm(6),
v3 = c(-0.2,0.3,-0.6,-0.2,1,0.2)
)
# Data prep
df <- df %>%
group_by(gr) %>%
arrange(t, .by_group = TRUE) %>%
mutate(R1=abs(v1-5*v2)) %>%
mutate(R2=abs(R1*v2)^(1/2)) %>% #
mutate(RI3=R1/R2) %>%
ungroup()
#going through df row by row
for (i in 1:nrow(df)) {
#browser()
# run into problems with i == 1 for the lagged operation, hence made two cases
if (i == 1) {
df$v1[i] <- if_else(is.na(df$v1[i]), df$RI3[i], df$v1[i])
} else {
df$v1[i] <- if_else(is.na(df$v1[i]), df$RI3[i-1], df$v1[i])
}
# rowwise calculation
df$R1[i] <- abs(df$v1[i]-5*df$v2[i])
df$R2[i] <- abs(df$R1[i]*df$v2[i])^(1/2)
df$RI3[i]=df$R1[i]/df$R2[i]
}

t-test of one group versus many groups in tidyverse

I have the following tibble
test_tbl <- tibble(name = rep(c("John", "Allan", "George", "Peter", "Paul"), each = 12),
category = rep(rep(LETTERS[1:4], each = 3), 5),
replicate = rep(1:3, 20),
value = sample.int(n = 1e5, size = 60, replace = T))
# A tibble: 60 x 4
name category replicate value
<chr> <chr> <int> <int>
1 John A 1 71257
2 John A 2 98887
3 John A 3 87354
4 John B 1 25352
5 John B 2 69913
6 John B 3 43086
7 John C 1 24957
8 John C 2 33928
9 John C 3 79854
10 John D 1 32842
11 John D 2 19156
12 John D 3 50283
13 Allan A 1 98188
14 Allan A 2 26208
15 Allan A 3 69329
16 Allan B 1 32696
17 Allan B 2 81240
18 Allan B 3 54689
19 Allan C 1 77044
20 Allan C 2 97776
# … with 40 more rows
I want to group_by(name, category) and perform 3 t.test calls, comparing category B, C and D with category A.
I would like to store the estimate and p.value from the output. The expected result is something like this:
# A tibble: 5 x 7
name B_vs_A_estimate B_vs_A_p_value C_vs_A_estimate C_vs_A_p_value D_vs_A_estimate D_vs_A_p_value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 John -0.578 0.486 0.198 0.309 0.631 0.171
2 Allan 0.140 0.644 0.728 0.283 0.980 0.485
3 George -0.778 0.320 -0.424 0.391 -0.154 0.589
4 Peter -0.435 0.470 -0.156 0.722 0.315 0.0140
5 Paul 0.590 0.0150 -0.473 0.475 0.681 0.407
I would prefer a solution using tidyverse and/or broom.
There are many ways to achieve the desired output but maybe this one is the more intuitive one and easy to debug (you can put a browser() anywhere)
test_tbl %>%
group_by(name) %>%
do({
sub_tbl <- .
expand.grid(g1="A", g2=c("B", "C", "D"), stringsAsFactors = FALSE) %>%
mutate(test=as.character(glue::glue("{g1}_vs_{g2}"))) %>%
rowwise() %>%
do({
gs <- .
t_res <- t.test(sub_tbl %>% filter(category == gs$g1) %>% pull(value),
sub_tbl %>% filter(category == gs$g2) %>% pull(value))
data.frame(test=gs$test, estimate=t_res$statistic, p_value=t_res$p.value,
stringsAsFactors = FALSE)
})
}) %>%
ungroup() %>%
gather(key="statistic", value="val", -name, -test) %>%
mutate(test_statistic = paste(test, statistic, sep = "_")) %>%
select(-test, -statistic) %>%
spread(key="test_statistic", value="val")
Result
# A tibble: 5 x 7
name A_vs_B_estimate A_vs_B_p_value A_vs_C_estimate A_vs_C_p_value A_vs_D_estimate A_vs_D_p_value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Allan -0.270 0.803 -1.03 0.396 1.55 0.250
2 George 0.201 0.855 0.221 0.838 1.07 0.380
3 John -1.59 0.249 0.0218 0.984 -0.410 0.704
4 Paul 0.116 0.918 -1.62 0.215 -1.53 0.212
5 Peter 0.471 0.664 0.551 0.611 0.466 0.680
It groups the records by name then apply a function (do #1). Save the sub dataframe in sub_tbl, expand all the test cases (expand.grid) and create a test name with the two letters combined. Now, for each combination apply the function to run the t-tests (do #2). That anonymous function performs the test between group 1 (g1) and group 2 (g2) and returns a dataframe with the results.
The second part basically rearranges the columns to have the final output.
test_tbl %>%
dplyr::group_by(name) %>%
dplyr::summarise(estimate_AB =
t.test(value[category == "A"| category == "B"] ~ category[category == "A" | category == "B"]) %>% (function(x){x$estimate[1] - x$estimate[2]}),
pvalue_AB = t.test(value[category == "A"| category == "B"] ~ category[category == "A" | category == "B"]) %>% (function(x){x$p.value})
)
Here is what I did for testing the A against B by group. I think that you could extend my approach, or try to incorporate the code from the first solution.
EDIT : cleanner code
map(unique(test_tbl$name),function(nm){test_tbl %>% filter(name == nm)}) %>%
map2(unique(test_tbl$name),function(dat,nm){
map(LETTERS[2:4],function(cat){
dat %>%
filter(category == "A") %>%
pull %>%
t.test(dat %>% filter(category == cat) %>% pull)
}) %>%
map_dfr(broom::glance) %>%
select(statistic,p.value) %>%
mutate(
name = nm,
cross_cat = paste0(LETTERS[2:4]," versus A")
)
}) %>%
{do.call(rbind,.)}
We can use
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
test_tbl %>%
split(.$name) %>%
map_dfr(~ {
Avalue <- .x$value[.x$category == 'A']
.x %>%
filter(category != 'A') %>%
group_by(category) %>%
summarise(out = t.test(value, Avalue)$p.value) %>%
mutate(category = str_c(category, '_vs_A_p_value'))}, .id = 'name') %>%
pivot_wider(names_from = category, values_from = out)

Custom function with dplyr mutate or summarise for different levels within a factor?

Here is some example data:
library(car)
library(dplyr)
df1 <- mtcars %>%
group_by(cyl, gear) %>%
summarise(
newvar = sum(wt)
)
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear newvar
<dbl> <dbl> <dbl>
1 4 3 2.46
2 4 4 19.0
3 4 5 3.65
4 6 3 6.68
5 6 4 12.4
6 6 5 2.77
7 8 3 49.2
8 8 5 6.74
What if I then wanted to apply a custom function calculating the difference between the newvar values for cars with 3 or 5 gears for each level of cylinder?
df2 <- df1 %>% mutate(Diff = newvar[gear == "3"] - newvar[gear == "5"])
or with summarise?
df2 <- df1 %>% summarise(Diff = newvar[gear == "3"] - newvar[gear == "5"])
There must be a way to apply functions for different levels within different factors?
Any help appreciated!
Your example code is most of the way there. You can do:
df1 %>%
mutate(Diff = newvar[gear == "3"] - newvar[gear == "5"])
Or:
df1 %>%
summarise(Diff = newvar[gear == "3"] - newvar[gear == "5"])
Logical subsetting still works in mutate() and summarise() calls like with any other vector.
Note that this works because after your summarise() call in your example code, df1 is still grouped by cyl, otherwise you would need to do a group_by() call to create the correct grouping.
An option is to spread into 'wide' format and then do the -
library(tidyverse)
df1 %>%
filter(gear %in% c(3, 5) ) %>%
spread(gear, newvar) %>%
transmute(newvar = `3` - `5`)
# A tibble: 3 x 2
# Groups: cyl [3]
# cyl newvar
# <dbl> <dbl>
#1 4 -1.19
#2 6 3.90
#3 8 42.5

Resources