R - Merging columns and renaming values - r

To perform an ANOVA I am looking to merge this data in a new variable called CompensationGroup. Thereby, people who have been in "Compensationproject1" should be displayed as 1, people who have been in "Compensationproject2" should be displayed as 2...
library(tidyverse)
data %>%
mutate(Compensationproject2 = case_when(
Compensationproject2 == 1 ~ 2,
T ~ NA_real_
)) %>%
mutate(Compensationproject3 = case_when(
Compensationproject3 == 1 ~ 3,
T ~ NA_real_
)) %>%
unite("CompensationGroup",c(Compensationproject1,Compensationproject2,Compensationproject3),remove = F,na.rm = T) %>%
mutate(CompensationGroup = str_extract(CompensationGroup,'\\d'))""

Here is an option that does not require the use of pivot. However, it requires the use of a bunch of other tidyversefunctions ala mutate, case_when, unite and str_extract.
library(tidyverse)
df <- tribble(~id,~Comp1,~Comp2,~Comp3,
1,1,NA,NA,
2,NA,1,NA,
3,NA,NA,1)
df %>%
mutate(Comp2 = case_when(
Comp2 == 1 ~ 2,
T ~ NA_real_
)) %>%
mutate(Comp3 = case_when(
Comp3 == 1 ~ 3,
T ~ NA_real_
)) %>%
unite("group",c(Comp1,Comp2,Comp3),remove = F,na.rm = T) %>%
mutate(group = str_extract(group,'\\d'))
Good luck!

In the absence of a reproducible example I give you an example with toy data using the pivot_longer function from the tidyverse package.
library(tidyverse)
data <- tibble::tribble(
~Category, ~MeasureA, ~MeasureB, ~MeasureC, ~MeasureD,
1, 0.0930158825381708, 0.0138786762728455, 0.0659906858706141, 0.0677226540871513,
2, 0.103201113378404, 0.0149293889876177, 0.0644022070960172, 0.0605524137103402,
3, 0.12028743617311, 0.0209951412575897, 0.0598004419601402, 0.0584817396677436,
4, 0.0996307145670469, 0.016288452837476, 0.0624144782432749, 0.0538275028212587
)
data
# A tibble: 4 x 5
Category MeasureA MeasureB MeasureC MeasureD
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.0930 0.0139 0.0660 0.0677
2 2 0.103 0.0149 0.0644 0.0606
3 3 0.120 0.0210 0.0598 0.0585
4 4 0.0996 0.0163 0.0624 0.0538
The following tells R to take columns 2:5 which hold the values, put the values into a column called Value and to put the label from the column names into a column called Measurement whilst removing the Measure label that is prefixed in the Measure columns.
data %>% pivot_longer(cols = 2:5, names_to = "Measurement", names_prefix = "Measure", values_to = "Value")
# A tibble: 16 x 3
Category Measurement Value
<dbl> <chr> <dbl>
1 1 A 0.0930
2 1 B 0.0139
3 1 C 0.0660
4 1 D 0.0677
5 2 A 0.103
6 2 B 0.0149
7 2 C 0.0644
8 2 D 0.0606
9 3 A 0.120
10 3 B 0.0210
11 3 C 0.0598
12 3 D 0.0585
13 4 A 0.0996
14 4 B 0.0163
15 4 C 0.0624
16 4 D 0.0538

Found an answer myself:
data[, "CompensationGroup"] <- 1
for(i in seq(2,3,1)){
data[which(is.na(data[,paste0("Compensationproject",i)]) == F), "CompensationGroup"] <- as.numeric(i)
}

Related

Apply t test over all columns of data frame seperated by variable [duplicate]

This question already has an answer here:
dplyr summarize across ttest
(1 answer)
Closed 7 months ago.
I have data frame like this
X1 X2 X3 X4 X5 class
1 1 7 3 9 5 n
2 2 8 4 10 6 n
3 3 9 5 1 7 n
4 4 10 6 2 8 p
5 5 1 7 3 9 p
6 6 2 8 4 10 p
I like to run t test over all columns, separated on groups formed by the variable class.
I know I can use for loop for this, but I wonder if there is better code for that.
m1 <- data.frame(matrix(c <- (1:10), nrow = 6, ncol = 5))
m1 <- data.frame(m1,c("n","n","n","p","p","p"))
names(m1)[6] = "class"
# work for one column
t.test(X1~class, data= m1)
# What I'm looking for
# t.test(X_i~class, data= m1)
library(dplyr)
library(tidyr)
library(broom)
df <- data.frame(
x1 = 0:9,
x2 = 10:19,
x3 = 20:29,
class = rep(c("a", "b"), each = 5)
)
# Conduct tests and store in nested data.frame
nested <- df %>%
group_by(class) %>%
summarise(across(everything(), ~ list(
t.test(.) %>%
tidy() %>%
select(estimate, statistic,
p.value, conf.low, conf.high)
)))
# Unnest and turn into long structure
long <- nested %>%
unnest(cols = starts_with("x"), names_sep = "_") %>%
pivot_longer(cols = starts_with("x"),
names_to = "quantity",
values_to = "value")
# Split variables into multiple columns and widen
long %>%
separate(col = quantity, into = c("variable", "quantity"), sep = "_") %>%
pivot_wider(names_from = "quantity")
#> # A tibble: 6 x 7
#> class variable estimate statistic p.value conf.low conf.high
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 a x1 2 2.83 0.0474 0.0368 3.96
#> 2 a x2 12 17.0 0.0000707 10.0 14.0
#> 3 a x3 22 31.1 0.00000636 20.0 24.0
#> 4 b x1 7 9.90 0.000584 5.04 8.96
#> 5 b x2 17 24.0 0.0000178 15.0 19.0
#> 6 b x3 27 38.2 0.00000281 25.0 29.0

Nested group_by operation in dplyr: does the second call include the first call?

In my data below, First, I'm want to group_by(study), and get the mean of X for each unique study value and subtract it from each X value in each study.
Second, and while groupe_by(study) is still in effect, I want to further group_by(outcome) within each study and get the mean of X for unique outcome value within a unique study value and subtract it from each X value in each outcome in each study.
I'm using the following workaround, but it seems it doesn't achieve my goal, because it seems the the group_by(outcome) call is ignoring the previous group_by(study).
Is there a way to achieve what I described above?
library(dplyr)
set.seed(0)
(data <- expand.grid(study = 1:2, outcome = rep(1:2,2)))
data$X <- rnorm(nrow(data))
(data <- arrange(data,study))
# study outcome X
#1 1 1 1.2629543
#2 1 2 1.3297993
#3 1 1 0.4146414
#4 1 2 -0.9285670
#5 2 1 -0.3262334
#6 2 2 1.2724293
#7 2 1 -1.5399500
#8 2 2 -0.2947204
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome) %>%
mutate(X_between_ou = mean(X), X_within_ou = X-X_between_ou)
Yes, the second group_by overwrites the previous group_by which can be checked with group_vars function.
library(dplyr)
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome) %>%
group_vars()
#[1] "outcome"
As you can see at this stage the data is grouped only by outcome.
You can achieve your goal by including .add = TRUE in group_by which will add to the existing groups.
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome, .add = TRUE) %>%
group_vars()
#[1] "study" "outcome"
So ultimately, now the code would become -
data %>%
group_by(study) %>%
mutate(X_between_st = mean(X), X_within_st = X-X_between_st) %>%
group_by(outcome, .add = TRUE) %>%
mutate(X_between_ou = mean(X), X_within_ou = X-X_between_ou)
# study outcome X X_between_st X_within_st X_between_ou X_within_ou
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1.26 0.520 0.743 0.839 0.424
#2 1 2 1.33 0.520 0.810 0.201 1.13
#3 1 1 0.415 0.520 -0.105 0.839 -0.424
#4 1 2 -0.929 0.520 -1.45 0.201 -1.13
#5 2 1 -0.326 -0.222 -0.104 -0.933 0.607
#6 2 2 1.27 -0.222 1.49 0.489 0.784
#7 2 1 -1.54 -0.222 -1.32 -0.933 -0.607
#8 2 2 -0.295 -0.222 -0.0726 0.489 -0.784
We may use cur_group
data %>%
group_by(study) %>%
summarise(grps = names(cur_group())) %>%
slice(1) %>%
pull(grps)
[1] "study"

How to pass multiple functions with multiple arguments using purr::map?

I'm trying to evaluate multiple functions on my dataset using purr library. I tried spliting the data and then apply map with the functions list but I don't seem to find the right way to introduce my varaibles as arguments. What I am missing?
library(tidyverse)
df <- tibble(f1 = c('a','a','a','b','b','c','c','c','d','d'),
f2 = c('z','x','y','z','x','z','x','y','z','x'),
obs = sample(1:10, 10),
pred = sample(1:10,10))
#RMSE
#------------------------------------------------------------------------------
rmse <- function(obs, model){
n = length(obs)
diff = (model - obs)**2
rmse = sqrt(sum(diff)) / n
return(rmse)
}
#Mean Bias
#------------------------------------------------------------------------------
mean_bias <- function(obs, model){
n =length(obs)
diff=model-obs
mean_bias=sum(diff)/n
return(mean_bias)
}
#Mean error
#------------------------------------------------------------------------------
mean_error <- function(obs, model){
n =length(obs)
abs=abs(model-obs)
mean_error=sum(abs)/n
return(mean_error)
}
mod_eval <- list(rmse, mean_bias,mean_error)
test <- df%>%
split(list(c(.$f1,.$f2)))%>%
invoke_map(.f=mod_eval)##????
I'll expect to get something like this
f1 f2 rmse mean_bias mean_error
1 a x 17 11 5
2 b x 17 11 5
3 c x 17 11 5
4 d x 17 11 5
5 a y 17 11 5
6 b y 17 11 5
You can pass a named list of functions and use map to apply each function to each combination.
library(dplyr)
library(purrr)
mod_eval <- lst(rmse, mean_bias,mean_error)
df %>%
split(.$f1) %>%
map_df(function(x) map_dbl(mod_eval, ~.x(x$obs, x$pred)), .id = 'f1')
# f1 rmse mean_bias mean_error
# <chr> <dbl> <dbl> <dbl>
#1 a 2.33 -1.67 3.67
#2 b 3.91 5.5 5.5
#3 c 2.92 -2.33 4.33
#4 d 2.5 0.5 3.5
We could use tidyverse approaches
library(dplyr)
library(tidyr)
mod_eval <- dplyr::lst(rmse, mean_bias, mean_error)
df %>%
nest_by(f1) %>%
crossing(mod_eval) %>%
mutate(nm1 = names(mod_eval)) %>%
rowwise %>%
transmute(f1, nm1, out = list(mod_eval(data$obs, data$pred))) %>%
unnest(out) %>%
pivot_wider(names_from = nm1, values_from = out)
-output
f1 rmse mean_bias mean_error
<chr> <dbl> <dbl> <dbl>
1 a 1.80 0.333 3
2 b 3.35 -1.5 4.5
3 c 3.09 4 4.67
4 d 4.53 -5 5
if it is grouped by 'f1', 'f2'
df %>%
nest_by(f1, f2) %>%
crossing(mod_eval) %>%
mutate(nm1 = names(mod_eval)) %>%
rowwise %>%
transmute(f1, f2, nm1, out = list(mod_eval(data$obs, data$pred))) %>%
unnest(out) %>%
pivot_wider(names_from = nm1, values_from = out)
-output
# A tibble: 10 x 5
f1 f2 rmse mean_bias mean_error
<chr> <chr> <dbl> <dbl> <dbl>
1 a x 4 -4 4
2 a y 2 2 2
3 a z 3 3 3
4 b x 3 3 3
5 b z 6 -6 6
6 c x 1 -1 1
7 c y 7 7 7
8 c z 6 6 6
9 d x 9 -9 9
10 d z 1 -1 1

t-test of one group versus many groups in tidyverse

I have the following tibble
test_tbl <- tibble(name = rep(c("John", "Allan", "George", "Peter", "Paul"), each = 12),
category = rep(rep(LETTERS[1:4], each = 3), 5),
replicate = rep(1:3, 20),
value = sample.int(n = 1e5, size = 60, replace = T))
# A tibble: 60 x 4
name category replicate value
<chr> <chr> <int> <int>
1 John A 1 71257
2 John A 2 98887
3 John A 3 87354
4 John B 1 25352
5 John B 2 69913
6 John B 3 43086
7 John C 1 24957
8 John C 2 33928
9 John C 3 79854
10 John D 1 32842
11 John D 2 19156
12 John D 3 50283
13 Allan A 1 98188
14 Allan A 2 26208
15 Allan A 3 69329
16 Allan B 1 32696
17 Allan B 2 81240
18 Allan B 3 54689
19 Allan C 1 77044
20 Allan C 2 97776
# … with 40 more rows
I want to group_by(name, category) and perform 3 t.test calls, comparing category B, C and D with category A.
I would like to store the estimate and p.value from the output. The expected result is something like this:
# A tibble: 5 x 7
name B_vs_A_estimate B_vs_A_p_value C_vs_A_estimate C_vs_A_p_value D_vs_A_estimate D_vs_A_p_value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 John -0.578 0.486 0.198 0.309 0.631 0.171
2 Allan 0.140 0.644 0.728 0.283 0.980 0.485
3 George -0.778 0.320 -0.424 0.391 -0.154 0.589
4 Peter -0.435 0.470 -0.156 0.722 0.315 0.0140
5 Paul 0.590 0.0150 -0.473 0.475 0.681 0.407
I would prefer a solution using tidyverse and/or broom.
There are many ways to achieve the desired output but maybe this one is the more intuitive one and easy to debug (you can put a browser() anywhere)
test_tbl %>%
group_by(name) %>%
do({
sub_tbl <- .
expand.grid(g1="A", g2=c("B", "C", "D"), stringsAsFactors = FALSE) %>%
mutate(test=as.character(glue::glue("{g1}_vs_{g2}"))) %>%
rowwise() %>%
do({
gs <- .
t_res <- t.test(sub_tbl %>% filter(category == gs$g1) %>% pull(value),
sub_tbl %>% filter(category == gs$g2) %>% pull(value))
data.frame(test=gs$test, estimate=t_res$statistic, p_value=t_res$p.value,
stringsAsFactors = FALSE)
})
}) %>%
ungroup() %>%
gather(key="statistic", value="val", -name, -test) %>%
mutate(test_statistic = paste(test, statistic, sep = "_")) %>%
select(-test, -statistic) %>%
spread(key="test_statistic", value="val")
Result
# A tibble: 5 x 7
name A_vs_B_estimate A_vs_B_p_value A_vs_C_estimate A_vs_C_p_value A_vs_D_estimate A_vs_D_p_value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Allan -0.270 0.803 -1.03 0.396 1.55 0.250
2 George 0.201 0.855 0.221 0.838 1.07 0.380
3 John -1.59 0.249 0.0218 0.984 -0.410 0.704
4 Paul 0.116 0.918 -1.62 0.215 -1.53 0.212
5 Peter 0.471 0.664 0.551 0.611 0.466 0.680
It groups the records by name then apply a function (do #1). Save the sub dataframe in sub_tbl, expand all the test cases (expand.grid) and create a test name with the two letters combined. Now, for each combination apply the function to run the t-tests (do #2). That anonymous function performs the test between group 1 (g1) and group 2 (g2) and returns a dataframe with the results.
The second part basically rearranges the columns to have the final output.
test_tbl %>%
dplyr::group_by(name) %>%
dplyr::summarise(estimate_AB =
t.test(value[category == "A"| category == "B"] ~ category[category == "A" | category == "B"]) %>% (function(x){x$estimate[1] - x$estimate[2]}),
pvalue_AB = t.test(value[category == "A"| category == "B"] ~ category[category == "A" | category == "B"]) %>% (function(x){x$p.value})
)
Here is what I did for testing the A against B by group. I think that you could extend my approach, or try to incorporate the code from the first solution.
EDIT : cleanner code
map(unique(test_tbl$name),function(nm){test_tbl %>% filter(name == nm)}) %>%
map2(unique(test_tbl$name),function(dat,nm){
map(LETTERS[2:4],function(cat){
dat %>%
filter(category == "A") %>%
pull %>%
t.test(dat %>% filter(category == cat) %>% pull)
}) %>%
map_dfr(broom::glance) %>%
select(statistic,p.value) %>%
mutate(
name = nm,
cross_cat = paste0(LETTERS[2:4]," versus A")
)
}) %>%
{do.call(rbind,.)}
We can use
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
test_tbl %>%
split(.$name) %>%
map_dfr(~ {
Avalue <- .x$value[.x$category == 'A']
.x %>%
filter(category != 'A') %>%
group_by(category) %>%
summarise(out = t.test(value, Avalue)$p.value) %>%
mutate(category = str_c(category, '_vs_A_p_value'))}, .id = 'name') %>%
pivot_wider(names_from = category, values_from = out)

Winners within pairs; or vector-valued group_by mutate?

I'm trying to assess which unit in a pair is the "winner". group_by() %>% mutate() is close to the right thing, but it's not quite there. in particular
dat %>% group_by(pair) %>% mutate(winner = ifelse(score[1] > score[2], c(1, 0), c(0, 1))) doesn't work.
The below does, but is clunky with an intermediate summary data frame. Can we improve this?
library(tidyverse)
set.seed(343)
# units within pairs get scores
dat <-
data_frame(pair = rep(1:3, each = 2),
unit = rep(1:2, 3),
score = rnorm(6))
# figure out who won in each pair
summary_df <-
dat %>%
group_by(pair) %>%
summarize(winner = which.max(score))
# merge back and determine whether each unit won
dat <-
left_join(dat, summary_df, "pair") %>%
mutate(won = as.numeric(winner == unit))
dat
#> # A tibble: 6 x 5
#> pair unit score winner won
#> <int> <int> <dbl> <int> <dbl>
#> 1 1 1 -1.40 2 0
#> 2 1 2 0.523 2 1
#> 3 2 1 0.142 1 1
#> 4 2 2 -0.847 1 0
#> 5 3 1 -0.412 1 1
#> 6 3 2 -1.47 1 0
Created on 2018-09-26 by the reprex
package (v0.2.0).
maybe related to Weird group_by + mutate + which.max behavior
You could do:
dat %>%
group_by(pair) %>%
mutate(won = score == max(score),
winner = unit[won == TRUE]) %>%
# A tibble: 6 x 5
# Groups: pair [3]
pair unit score won winner
<int> <int> <dbl> <lgl> <int>
1 1 1 -1.40 FALSE 2
2 1 2 0.523 TRUE 2
3 2 1 0.142 TRUE 1
4 2 2 -0.847 FALSE 1
5 3 1 -0.412 TRUE 1
6 3 2 -1.47 FALSE 1
Using rank:
dat %>% group_by(pair) %>% mutate(won = rank(score) - 1)
More for fun (and slightly faster), using the outcome of the comparison (score[1] > score[2]) to index a vector with 'won alternatives' :
dat %>% group_by(pair) %>%
mutate(won = c(0, 1, 0)[1:2 + (score[1] > score[2])])

Resources