I am looking for a solution for my problem. I just can solve it with manually rearranging.
Example code:
library(dplyr)
set.seed(1)
Data <- data.frame(
W = sample(1:10),
X = sample(1:10),
Y = sample(c("yes", "no"), 10, replace = TRUE),
Z = sample(c("cat", "dog"), 10, replace = TRUE)
)
#
summarized <- Data %>% group_by(Z) %>% summarise_if(is.numeric,funs(mean,median),na.rm=T)
print(Data)
I want the output looks like below, with each function applied to the first col and then and each function applied to the second col and so on. My code does it vice versa.
Of course I could rearrange the cols but that is not what Data Science is about. I have hundreds of cols and want to apply multiple functions.
This is what I want:
summarized <- summarized[,c(1,2,4,3,5)] #best solution yet
Is there any argument I am missing? I bet there is an easy solution or an other function does the job.
Guys, thx in advance!
One option would be to post-process with adequate select_helpers
library(dplyr)
summarized %>%
select(Z, starts_with('W'), everything())
# A tibble: 2 x 5
# Z W_mean W_median X_mean X_median
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 cat 5.25 5.5 3.75 3.5
#2 dog 5.67 5.5 6.67 7
If there are 100s of columns, one approach is to get the substring of the column names, and order
library(stringr)
summarized %>%
select(Z, order(str_remove(names(.), "_.*")))
# A tibble: 2 x 5
# Z W_mean W_median X_mean X_median
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 cat 5.25 5.5 3.75 3.5
#2 dog 5.67 5.5 6.67 7
You can use starts_with() to select the columns, instead of by number.
library(dplyr)
set.seed(1)
Data <- data.frame(
W = sample(1:10),
X = sample(1:10),
Y = sample(c("yes", "no"), 10, replace = TRUE),
Z = sample(c("cat", "dog"), 10, replace = TRUE)
)
summarized <-
Data %>%
group_by(Z) %>%
summarise_if(is.numeric,funs(mean,median),na.rm=T) %>%
select(Z, starts_with("W_"), starts_with("X_"))
summarized
#> # A tibble: 2 x 5
#> Z W_mean W_median X_mean X_median
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 cat 5.25 5.5 3.75 3.5
#> 2 dog 5.67 5.5 6.67 7
Created on 2019-12-09 by the reprex package (v0.3.0)
Related
I recently understood how to access a column names inside a user defined function: How to access a column name in a user defined function with dplyr?
However, now I also need to access the column names within the operations that are being carried out. For example I would like to do this:
samp_df <- tibble(var1 = c('a', 'b', 'c'),
var_in_df = c(3,7,9))
calculateSummaries <- function(df, variable){
df <- df %>%
mutate("mean_of_{{variable}}" := mean({{variable}}),
"sd_of_{{variable}}" := sd({{variable}}),
"sd_plus_mean_of_{{variable}}" := ("mean_of_{{variable}}" + "sd_of_{{variable}}")
)
}
df_result <- calculateSummaries(samp_df, var_in_df)
Of course I could do:
"sd_plus_mean_of_{{variable}}" := mean({{variable}}) + sd({{variable}})
But in practice, with the real data this won't be practical.
Does anyone know how to so this?
This case ineed a little bit tricky, I think we have to constuct the names first and then use !! sym() to evaluate the strings as objects.
library(dplyr)
samp_df <- tibble(var1 = c('a', 'b', 'c'),
var_in_df = c(3,7,9))
calculateSummaries <- function(df, variable){
var_nm <- deparse(substitute(variable))
mean_var_nm <- paste0("mean_of_", var_nm)
sd_var_nm <- paste0("sd_of_", var_nm)
df %>%
mutate("mean_of_{{variable}}" := mean({{variable}}),
"sd_of_{{variable}}" := sd({{variable}}),
"sd_plus_mean_of_{{variable}}" := !! sym(mean_var_nm) + !! sym(sd_var_nm)
)
}
calculateSummaries(samp_df, var_in_df)
#> # A tibble: 3 x 5
#> var1 var_in_df mean_of_var_in_df sd_of_var_in_df sd_plus_mean_of_var_in_df
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 3 6.33 3.06 9.39
#> 2 b 7 6.33 3.06 9.39
#> 3 c 9 6.33 3.06 9.39
An alternative way is using across(), but we still have to construct the variable names.
calculateSummaries <- function(df, variable){
df %>%
mutate("mean_of_{{variable}}" := mean({{variable}}),
"sd_of_{{variable}}" := sd({{variable}}),
across(c({{ variable }}),
list(sd_plus_mean_of = ~ get(paste0("mean_of_", cur_column())) + get(paste0("sd_of_", cur_column())))
)
)
}
calculateSummaries(samp_df, var_in_df)
#> # A tibble: 3 x 5
#> var1 var_in_df mean_of_var_in_df sd_of_var_in_df var_in_df_sd_plus_mean_of
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 3 6.33 3.06 9.39
#> 2 b 7 6.33 3.06 9.39
#> 3 c 9 6.33 3.06 9.39
Here is a final way inspired by Lionel Henry's answer to this question. We can use rlang::englue() to construct names and use those names with the .data[[...]] pronoun.
calculateSummaries <- function(df, variable){
mean_var_nm <- rlang::englue("mean_of_{{ variable }}")
sd_var_nm <- rlang::englue("sd_of_{{ variable }}")
df %>%
mutate("mean_of_{{ variable }}" := mean({{ variable }}),
"sd_of_{{ variable }}" := sd({{ variable }}),
"sd_plus_mean_of_{{ variable }}" := .data[[mean_var_nm]] + .data[[sd_var_nm]]
)
}
calculateSummaries(samp_df, var_in_df)
#> # A tibble: 3 x 5
#> var1 var_in_df mean_of_var_in_df sd_of_var_in_df sd_plus_mean_of_var_in_df
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 3 6.33 3.06 9.39
#> 2 b 7 6.33 3.06 9.39
#> 3 c 9 6.33 3.06 9.39
Created on 2022-10-13 by the reprex package (v2.0.1)
According to this tidyverse blog post glue strings are only supported as result names, which IMHO means only on the LHS.
Besides the options offered by #TimTeaFan another option would be to use across to compute all desired values and name the columns using the .names argument:
library(dplyr)
calculateSummaries1 <- function(df, variable) {
df <- df %>%
mutate(across({{ variable }},
.fns = list(
mean = mean,
sd = sd,
sd_plus_mean = ~ mean(.x) + sd(.x)
),
.names = "{.fn}_of_{.col}"
))
df
}
calculateSummaries1(samp_df, var_in_df)
#> # A tibble: 3 × 5
#> var1 var_in_df mean_of_var_in_df sd_of_var_in_df sd_plus_mean_of_var_in_df
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 3 6.33 3.06 9.39
#> 2 b 7 6.33 3.06 9.39
#> 3 c 9 6.33 3.06 9.39
And a second option would be to use some helper variable names for the mean and the sd which avoids to use glue syntax one the RHS but requires an additional rename step:
calculateSummaries2 <- function(df, variable) {
df <- df %>%
mutate(
mean = mean({{ variable }}),
sd = sd({{ variable }}),
"sd_plus_mean_of_{{variable}}" := mean + sd
) |>
rename("mean_of_{{variable}}" := mean, "sd_of_{{variable}}" := sd)
df
}
calculateSummaries2(samp_df, var_in_df)
#> # A tibble: 3 × 5
#> var1 var_in_df mean_of_var_in_df sd_of_var_in_df sd_plus_mean_of_var_in_df
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 a 3 6.33 3.06 9.39
#> 2 b 7 6.33 3.06 9.39
#> 3 c 9 6.33 3.06 9.39
I have a tibble and I want create several summaries of the same column, specifically the first, second and third quartiles.
To do it, I create a named list of functions and that works fine.
library("tidyverse")
set.seed(1234)
df <- tibble(x = rnorm(100))
df %>%
summarise(
across(x,
list(
Q1 = ~ quantile(., 1 / 4),
Q2 = ~ quantile(., 2 / 4),
Q3 = ~ quantile(., 3 / 4)
),
.names = "{.fn}"
)
)
#> # A tibble: 1 × 3
#> Q1 Q2 Q3
#> <dbl> <dbl> <dbl>
#> 1 -0.895 -0.385 0.471
Can I achieve this by specifying the list of probabilities to pass to quantile? So that I save myself typing and more importantly avoid hard-coding the arguments to pass to the aggregating function.
The following doesn't work because it creates one row per probability rather than one column.
df %>%
summarise(
across(x, quantile, 1:3 / 4)
)
#> # A tibble: 3 × 1
#> x
#> <dbl>
#> 1 -0.895
#> 2 -0.385
#> 3 0.471
you're almost here
df <- tibble(x = rnorm(100))
df %>%
summarise(
across(x,
map(1:3, ~partial(quantile, probs=./4)),
.names = "Q{.fn}"
)
)
# A tibble: 1 x 3
Q1 Q2 Q3
<dbl> <dbl> <dbl>
1 -0.579 0.0815 0.475
If you define the quantiles like this:
Q <- c(0.25, 0.5, 0.75)
Then the following code will produce columns of the appropriate quantiles with sensible labels:
df %>%
summarise(
across(x,
setNames( lapply(Q,
function(x) { f <- ~quantile(., b); f[2][[1]][[3]] <- x; f }),
paste("Q", round(100 * Q), sep = "_")),
.names = "{.fn}"
)
)
#> # A tibble: 1 x 3
#> Q_25 Q_50 Q_75
#> <dbl> <dbl> <dbl>
#> 1 -0.895 -0.385 0.471
Created on 2022-06-29 by the reprex package (v2.0.1)
I have a function that generates a dataframe with 2 cols (X and Y).
I want to use map_dfc but I would like to change the suffixes "...1", "...2" and so on that appear because the col names are the same
I would like something as (X_df1, Y_df1, X_df2, Y_df2, ...). Is there a suffix parameter? I've read the documentation and couldn't find
I don't want to use map_dfr because I need the dataframe to be wide.
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
map2_dfc(values$n1, values$n2, example_function)
gives me
A tibble: 1 x 4
X...1 Y...2 X...3 Y...4
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
And I want
A tibble: 1 x 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Thanks!
If we don't want to change the function, we can rename before binding the cols - use pmap to loop over the rows the data, apply the function (example_function), loop over the list with imap, rename all the columns of the list of tibbles with the list index and then use bind_cols
library(dplyr)
library(purrr)
library(stringr)
pmap(values, example_function) %>%
imap(~ {nm1 <- str_c('_df', .y)
rename_with(.x, ~ str_c(., nm1), everything())
}) %>%
bind_cols
-output
# A tibble: 1 × 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Or you could just build the new names first and apply them after you call map2_dfc():
library(purrr)
library(tibble)
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
new_names <- lapply(seq_len(ncol(values)), function(x) paste0(c("X", "Y"), "_df", x)) %>%
unlist()
map2_dfc(values$n1, values$n2, example_function) %>%
setNames(new_names)
#> New names:
#> * X -> X...1
#> * Y -> Y...2
#> * X -> X...3
#> * Y -> Y...4
#> # A tibble: 1 x 4
#> X_df1 Y_df1 X_df2 Y_df2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 5 8 12
Created on 2022-04-08 by the reprex package (v2.0.1)
I have a tibble with a number of variables collected over time. A very simplified version of the tibble looks like this.
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
I want to systematically create a new set of variables varC so that varC.t# = varA.t# / varB.t# where # is 1, 2, 3, etc. (similarly to the way column names are setup in the tibble above).
How do I use something along the lines of mutate or across to do this?
You can do something like this with mutate(across..., however, for renaming columns there must be a shortcut.
df %>%
mutate(across(.cols = c(varA.t1, varA.t2),
.fns = ~ .x / get(glue::glue(str_replace(cur_column(), "varA", "varB"))),
.names = "V_{.col}")) %>%
rename_with(~str_replace(., "V_varA", "varC"), starts_with("V_"))
# A tibble: 2 x 7
id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 row_1 5 10 2 4 2.5 2.5
2 row_2 20 50 4 6 5 8.33
If there is a long time series you can also create a vector for .cols beforehand.
I have a package on GitHub called {dplyover} which aims to solve this kind of problem in way similar to dplyr::across.
The function is called across2. It lets you define two sets of columns to which you can apply one or several functions. The .names argument supports two glue specifictions: {pre} and {suf}. They extract the shared pre- and suffix of the variable names. This makes it easy to put nice names on our output variables.
The function has one caveat. It is not performant when applied to highly grouped data (there is a vignette with benchmarks).
library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
df %>%
mutate(across2(starts_with("varA"),
starts_with("varB"),
~ .x / .y,
.names = "{pre}C.{suf}"))
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v0.3.0)
For such cases I find using base R easy and efficient.
varAcols <- sort(grep('varA', names(df), value = TRUE))
varBcols <- sort(grep('varB', names(df), value = TRUE))
df[sub('A', 'C', varAcols)] <- df[varAcols]/df[varBcols]
# id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 row_1 5 10 2 4 2.5 2.5
#2 row_2 20 50 4 6 5 8.33
Another way to do this with some customization is
Initial setup
library(dplyr)
library(purrr)
library(stringr)
df = tribble(
~id, ~varA.t1, ~varA.t2, ~varB.t1, ~varB.t2,
'row_1', 5, 10, 2, 4,
'row_2', 20, 50, 4, 6
)
# A function take in a formula then parse it and correct the column name
operation_function <- function(df, formula) {
# Extract the column name from the formula
new_column_name <- str_extract(formula, "^.+=")
new_column_name <- trimws(gsub("=", "", new_column_name))
# Process the df
df %>%
# parse the formula - this reuslt in new column name as value formula
mutate(!!rlang::parse_expr(formula)) %>%
# rename the new created column with the correct column name
rename(!!new_column_name := last_col())
}
Note: I think there should be more efficient way to implement the formula that have proper name. Though I couldn't figure it out right now. Welcome ideas from others
Prepare the formula to be process by the data. In this case it simple
For more complicated formula you may want to do it a little bit differently
# Prepare the formula
base_formula <- c("varC.t# = varA.t# / varB.t#")
replacement_list <- c(1, 2)
list_formula <- map(replacement_list, .f = gsub,
pattern = "#", x = base_formula)
list_formula
#> [[1]]
#> [1] "varC.t1 = varA.t1 / varB.t1"
#>
#> [[2]]
#> [1] "varC.t2 = varA.t2 / varB.t2"
Finally process the data with the list of formulas
# process with the function and then reduce them with left_join
reduce(map(.x = list_formula, .f = operation_function, df = df),
left_join)
#> Joining, by = c("id", "varA.t1", "varA.t2", "varB.t1", "varB.t2")
#> # A tibble: 2 x 7
#> id varA.t1 varA.t2 varB.t1 varB.t2 varC.t1 varC.t2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 row_1 5 10 2 4 2.5 2.5
#> 2 row_2 20 50 4 6 5 8.33
Created on 2021-04-10 by the reprex package (v1.0.0)
Why doesn't dplyr like this format of 'beta linalool' in my function as compared to beta.linalool?
It took me a few hours of troubleshooting to figure out what the problem was. Is there any way to use data where variables are labeled as more than one word or should I just move everything to the beta.linalool type format?
Everything I have learned has been from Programming with dplyr.
library(ggplot2)
library(readxl)
library(dplyr)
library(magrittr)
Data3<- read_excel("Desktop/Data3.xlsx")
Data3 %>% filter(Variety=="CS 420A"&`Red Blotch`=="-")%>% group_by(`Time Point`)%>%
summarise(m=mean(`beta linalool`),SD=sd(`beta linalool`))
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End 0.00300 0.000117
2 Mid 0.00385 0.000353
3 Must 0.000254 0.00000633
4 Start 0.000785 0.000283
Now when I work it into a function:
cwine<-function(df,v,rb,c){
c<-enquo(c)
df %>% filter(Variety==v&`Red Blotch`==rb)%>%
group_by(`Time Point`) %>%
summarise_(m=mean(!!c),SD=sd(!!c)) %>%
}
cwine(Data3,"CS 420A","-",'beta linalool')
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End NA NA
2 Mid NA NA
3 Must NA NA
4 Start NA NA
Warning messages:
1: In mean.default(~"beta linalool") :
argument is not numeric or logical: returning NA #this statement is repeated 4 more times
5: In var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
NAs introduced by coercion #this statement is repeated 4 more times
The problem lies in that beta linalool is typed in as 'beta linalool'. I figured this out by trying this methodology on the iris dataset and seeing that Petal.Length is not 'Petal Width':
my_function<-function(ds,x,y,c){
c<-enquo(c)
ds %>%filter(Sepal.Length>x&Sepal.Width<y) %>%
group_by(Species) %>%
summarise(m=mean(!!c),SD=sd(!!c))
}
my_function2(iris,5,4,Petal.Length)
# A tibble: 3 x 3
Species m SD
<fct> <dbl> <dbl>
1 setosa 1.53 0.157
2 versicolor 4.32 0.423
3 virginica 5.57 0.536
In fact my function works fine on a different variable:
> cwine(Data2,"CS 420A","-",nerol)
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End 0.000453 0.0000338
2 Mid 0.000659 0.0000660
3 Must 0.000560 0.0000234
4 Start 0.000927 0.0000224
Is dplyr just that sensitive or am I missing something?
One option would be convert it to symbol and evaluate it
library(tidyverse)
cwine <- function(df,v,rb,c){
df %>%
filter(Variety==v & `Red Blotch` == rb)%>%
group_by(`Time Point`) %>%
summarise(m = mean(!!rlang::sym(c)),
SD = sd(!! rlang::sym(c)))
}
cwine(Data3,"CS 420A","-",'beta linalool')
# A tibble: 2 x 3
# `Time Point` m SD
# <int> <dbl> <dbl>
#1 2 -2.11 2.23
#2 4 0.0171 NA
Also, if we want to pass it by converting to quosure (enquo), it works, when we pass the variable name with backquotes (usually, unquoted version works, but here there is a space between words and to evaluate it as it is, backquote is needed)
cwine <- function(df,v,rb,c){
c1 <- enquo(c)
df %>%
filter(Variety==v & `Red Blotch` == rb)%>%
group_by(`Time Point`) %>%
summarise(m = mean(!! c1 ),
SD = sd(!! c1))
}
cwine(Data3,"CS 420A","-",`beta linalool`)
# A tibble: 2 x 3
# `Time Point` m SD
# <int> <dbl> <dbl>
#1 2 -2.11 2.23
#2 4 0.0171 NA
data
set.seed(24)
Data3 <- tibble(Variety = sample(c("CS 420A", "CS 410A"), 20, replace = TRUE),
`Red Blotch` = sample(c("-", "+"), 20, replace = TRUE),
`Time Point` = sample(1:4, 20, replace = TRUE),
`beta linalool` = rnorm(20))