Consider this simple example
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
> dataframe
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
and this tidyeval function that uses dplyr to aggregate my dataframe according to some input column.
func_tidy <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
now, this works
> func_tidy(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
but doing the same thing from within a loop FAILS
for(col in c(group)){
func_tidy(dataframe, col)
}
Error in grouped_df_impl(data, unname(vars), drop) : Column `col` is unknown
What is the problem here? How can I use my tidyeval function in a loop?
Thanks!
For looping through column names you will need to use character strings.
for(col in "group")
When you pass this variable to your function, you will need to convert it from a character string to a symbol using rlang::sym. You use !! to unquote so the expression is evaluated.
So your loop would look like (I add a print to see the output):
for(col in "group"){
print( func_tidy(dataframe, !! rlang::sym(col) ) )
}
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
Related
I'd like to create several new columns. They should take their names from one vector and they should be computed by taking one column in the data and dividing it by another.
mytib <- tibble(id = 1:2, value1 = c(4,6), value2 = c(42, 5), total = c(2,2))
myvalues <- c("value1", "value2")
mynames <- c("value1_percent", "value2_percent")
mytib %>%
mutate({{ mynames }} := {{ myvalues }}/total)
Here, I get the error message, which makes me think that the curly-curly operator is misplaced
Error in local_error_context(dots = dots, .index = i, mask = mask) : promise already under evaluation: recursive default argument reference or earlier problems?
I'd like to calculate the percentage columns programmatically (since I have many such columns in my data).
The desired output should be equivalent to this:
mytib %>%
mutate( "value1_percent" = value1/total, "value2_percent" = value2/total)
which gives
# A tibble: 2 × 6
id value1 value2 total value1_percent value2_percent
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 42 2 2 21
2 2 6 5 2 3 2.5
You could use across and construct the new names in its .names argument:
library(dplyr)
mytib %>%
mutate(across(starts_with('value'),
~ .x / total,
.names = "{.col}_percent"
))
I prefer mutate(across(...)) in this case. To make your idea work, try reduce2() from purrr.
library(dplyr)
library(purrr)
reduce2(mynames, myvalues,
~ mutate(..1, !!..2 := !!sym(..3)/total), .init = mytib)
# # A tibble: 2 x 6
# id value1 value2 total value1_percent value2_percent
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 4 42 2 2 21
# 2 2 6 5 2 3 2.5
The above code is actually a shortcut of:
mytib %>%
mutate(!!mynames[1] := !!sym(myvalues[1])/total,
!!mynames[2] := !!sym(myvalues[2])/total)
I have some question for programming using dplyr and for loop in order to create multiple data. The code without loop works very well, but the code with for loop doesn't give me the expected result as well as error message.
Error message was like:
"Error in UseMethod ("select_") : no applicable method for 'select_'
applied to an object of class "character"
Please anyone put me on the right way.
The code below worked
B <- data %>% select (column1) %>% group_by (column1) %>% arrange (column1) %>% summarise (n = n ())
The code below did not work
column_list <- c ('column1', 'column2', 'column3')
for (b in column_list) {
a <- data %>% select (b) %>% group_by (b) %>% arrange (b) %>% summarise (n = n () )
assign (paste0(b), a)
}
Don't use assign. Instead use lists.
We can use _at variations in dplyr which works with characters variables.
library(dplyr)
split_fun <- function(df, col) {
df %>% group_by_at(col) %>% summarise(n = n()) %>% arrange_at(col)
}
and then use lapply/map to apply it to different columns
purrr::map(column_list, ~split_fun(data, .))
This will return you a list of dataframes which can be accessed using [[ individually if needed.
Using example with mtcars
df <- mtcars
column_list <- c ('cyl', 'gear', 'carb')
purrr::map(column_list, ~split_fun(df, .))
#[[1]]
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
#[[2]]
# A tibble: 3 x 2
# gear n
# <dbl> <int>
#1 3 15
#2 4 12
#3 5 5
#[[3]]
# A tibble: 6 x 2
# carb n
# <dbl> <int>
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1
I'd like to drop rows from my dataset that are all NAs (AKA keep rows with any non-NAs) for a list of columns. How could I update this code so that x & y are supplied as a vector? This would enable me to flexibly add and drop columns for inspection.
library(dplyr)
ds <-
tibble(
id = c(1:4),
x = c(NA, 1, NA, 4),
y = c(NA, NA , 3, 4)
)
ds %>%
rowwise() %>%
filter(
any(
!is.na(x),
!is.na(y)
)
) %>%
ungroup()
I'm trying to write something like any(!is.na(c(x,y))) but I'm not sure how to supply multiple arguments to is.na().
We can use filter_at with any_vars
ds %>%
filter_at(vars(x:y), any_vars(!is.na(.)))
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 2 1 NA
#2 3 NA 3
#3 4 4 4
-Update - Feb 7 2022
In the new version of dplyr (as #GitHunter0 suggested) can use if_all/if_any or across
ds %>%
filter(if_any(x:y, complete.cases))
# A tibble: 3 × 3
id x y
<int> <dbl> <dbl>
1 2 1 NA
2 3 NA 3
3 4 4 4
You can also use ds %>% filter(!if_all(x:y, is.na)).
I recently had to compile a data frame of student scores (one row per student, id column and several integer-valued columns, one per score component). I had to combine a "master" data frame and several "correction" data frames (containing mostly NA and some updates to the master), so that the result contains the maximum values from the master, and all corrections.
I succeeded by copy-pasting a sequence of mutate() calls, which works (see example below), but is not elegant in my opinion. What I would have wanted to do, was instead of copying and pasting, to use something along the lines of map2 and two lists of columns to compare the columns pair-wise. Something like (which obviously does not work as such):
list_of_cols1 <- list(col1.x, col2.x, col3.x)
list_of_cols2 <- list(col1.y, col2.y, col3.y
map2(list_of_cols1, list_of_cols2, ~ column = pmax(.x, .y, na.rm=T))
I can't seem to be able to figure out to do it. My question is: how to specify such lists of columns and mutate them in one map2() call in dplyr pipe, or is it even possible – have I gotten it all wrong?
Minimum working example
library(tidyverse)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- reduce(
# Ultimately there would several correction data frames
list(master, correction1),
function(x,y) {
x <- x %>%
left_join(
y,
by = c("id")
) %>%
# Wish I knew how to do this mutate call with map2
mutate(
col1 = pmax(col1.x, col1.y, na.rm=T),
col2 = pmax(col2.x, col2.y, na.rm=T),
col3 = pmax(col3.x, col3.y, na.rm=T)
) %>%
select(id, col1:col3)
}
)
The result is
> result
# A tibble: 3 x 4
id col1 col2 col3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Rather than do a left_join, just bind the rows then summarize. For example
result <- reduce(
list(master, master),
function(x,y) {
bind_rows(x, y) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
}
)
result
# id col1 col2 col3
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 3
# 2 2 1 2 3
# 3 3 2 3 3
Actually, you don't even need reduce as bind_rows can take a list
Adding another table
correction2 <- tibble(id=2,col1=NA,col2=8,col3=NA)
bind_rows(master, correction1, correction2) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
Sorry this doesn't answer your question about map2, I find it's easier to aggregate over rows than it is over columns in tidy R:
library(dplyr)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- list(master, correction1) %>%
bind_rows() %>%
group_by(id) %>%
summarise_all(max, na.rm = TRUE)
result
#> # A tibble: 3 x 4
#> id col1 col2 col3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 2 3
#> 3 3 2 3 3
If correction tables will always have the same structure as master, you can do something like the following:
library(dplyr)
library(purrr)
update_master = function(...){
map(list(...), as.matrix) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame()
}
update_master(master, correction1)
To allow id to take character values, make the following modification:
update_master = function(x, ...){
map(list(x, ...), function(x) as.matrix(x[-1])) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame(id = x[[1]], .)
}
update_master(master, correction1)
Result:
id col1 col2 col3
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Consider this very simple example
library(dplyr)
library(broom)
dataframe <- data_frame(id = c(1,2,3,4,5,6),
group = c(1,1,1,2,2,2),
value = c(200,400,120,300,100,100))
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
Here I want to write a function that outputs the upper bound of the confidence estimate for the mean of value. That is,
get_ci_high <- function(data, myvar){
confint_tidy(lm(data = data, myvar ~ 1)) %>% pull(conf.high)
}
Now, this works easily
confint_tidy(lm(data = dataframe, value ~ 1)) %>% pull(conf.high)
[1] 332.9999
This works as well (note the call after a group_by)
dataframe %>% group_by(group) %>% mutate(dealwithit = get_ci_high(., value))
# A tibble: 6 x 4
# Groups: group [2]
id group value dealwithit
<dbl> <dbl> <dbl> <dbl>
1 1 1 200 598.2674
2 2 1 400 598.2674
3 3 1 120 598.2674
4 4 2 300 453.5102
5 5 2 100 453.5102
6 6 2 100 453.5102
This works wonderfully
mindblow <- function(data, groupvar, outputvar){
quo_groupvar <- enquo(groupvar)
quo_outputvar <- enquo(outputvar)
data %>% group_by(!!quo_groupvar) %>%
summarize(output = get_ci_high(., !!quo_outputvar))%>%
ungroup()
}
> mindblow(dataframe, groupvar = group, outputvar = value)
# A tibble: 2 x 2
group output
<dbl> <dbl>
1 1 598.2674
2 2 453.5102
... but this FAILS
get_ci_high(dataframe, value)
Error in eval(expr, envir, enclos) : object 'value' not found
I dont get what is wrong here. I really need a solution that works in the four cases above.
Any ideas?
Many thanks!!
The reason is that when you pass the value argument, you want R to use its name "value" in the formula, rather than the value of the variable (which doesn't exist).
One solution would be to extract the name using substitute() (non-standard evaluation), and create a formula using as.formula:
get_ci_high <- function(data, myvar) {
col_name <- as.character(substitute(myvar))
fmla <- as.formula(paste(col_name, "~ 1"))
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value)
However, I'd strongly recommend passing the formula value ~ 1 as the second argument instead. This is both simpler and more flexible for performing other linear models (when you have predictors as well).
get_ci_high <- function(data, fmla) {
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value ~ 1)