R - dplyr. Functions with variable similar to dataframe columns

R - dplyr. Functions with variable similar to dataframe columns - r

I have this case where I am filtering on a dataframe in a function, but the dataframe has the column with a similar name as the variable I want to filter on.
example:
d = tibble(cond = c(1,2), b = c(1,2))
f_ = function(data, cond) {
data = data %>% filter(b == cond)
return(data)
}
f_(d, cond = 2)
# A tibble: 2 x 2
cond b
<dbl> <dbl>
1 1 1
2 2 2
No filtering happens (because here cond is equal to b).
this becomes an issue when I do not control the number of columns in the data but at the minimum I know it has the b column.

We can change the function to evaluate the 'cond' not from the environment
f_ = function(data, cond) {
data %>%
filter(b == !!cond)
}
f_(d, cond = 2)
# A tibble: 1 x 2
# cond b
# <dbl> <dbl>
#1 2 2

Related

Create several new variables using a vector of names and a vector for computation within dplyr::mutate

I'd like to create several new columns. They should take their names from one vector and they should be computed by taking one column in the data and dividing it by another.
mytib <- tibble(id = 1:2, value1 = c(4,6), value2 = c(42, 5), total = c(2,2))
myvalues <- c("value1", "value2")
mynames <- c("value1_percent", "value2_percent")
mytib %>%
mutate({{ mynames }} := {{ myvalues }}/total)
Here, I get the error message, which makes me think that the curly-curly operator is misplaced
Error in local_error_context(dots = dots, .index = i, mask = mask) : promise already under evaluation: recursive default argument reference or earlier problems?
I'd like to calculate the percentage columns programmatically (since I have many such columns in my data).
The desired output should be equivalent to this:
mytib %>%
mutate( "value1_percent" = value1/total, "value2_percent" = value2/total)
which gives
# A tibble: 2 × 6
id value1 value2 total value1_percent value2_percent
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 42 2 2 21
2 2 6 5 2 3 2.5

You could use across and construct the new names in its .names argument:
library(dplyr)
mytib %>%
mutate(across(starts_with('value'),
~ .x / total,
.names = "{.col}_percent"
))

I prefer mutate(across(...)) in this case. To make your idea work, try reduce2() from purrr.
library(dplyr)
library(purrr)
reduce2(mynames, myvalues,
~ mutate(..1, !!..2 := !!sym(..3)/total), .init = mytib)
# # A tibble: 2 x 6
# id value1 value2 total value1_percent value2_percent
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 4 42 2 2 21
# 2 2 6 5 2 3 2.5
The above code is actually a shortcut of:
mytib %>%
mutate(!!mynames[1] := !!sym(myvalues[1])/total,
!!mynames[2] := !!sym(myvalues[2])/total)

grouped summarize still gives result for each individual row

I have the following data:
library(tidyverse)
df <- data.frame(id = c(1,1,1,2,2,2),
x = rep(letters[1:2], each = 3),
y = c(3,4,3,5,6,5),
z = c(7,8,9,10,11,12))
I now want to summarize the data by id in a way where I get the sum of z depending on y values. The y condition itself depends on the value of x.
I thought I could use the code below, but this gives me all input ids and doesn‘t summarize. The result is correct, but I still want to have one row per id.
df %>%
group_by(id) %>%
summarize(test = case_when(x == 'a' ~ sum(z[y == 3]),
x == 'b' ~ sum(z[y == 5])))
# A tibble: 6 x 2
# Groups: id [2]
id test
<dbl> <dbl>
1 1 16
2 1 16
3 1 16
4 2 22
5 2 22
6 2 22
The following works, but I don‘t understand why it does and the above code does not.
df %>%
group_by(id) %>%
summarize(test = case_when(all(x == 'a') ~ sum(z[y == 3]),
all(x == 'b') ~ sum(z[y == 5])))
# A tibble: 2 x 2
id test
<dbl> <dbl>
1 1 16
2 2 22
Also, is there a more straigthforward way to do my summarization?

Because, case_when similar to ifelse(test, x, y) will return a vector of the same length as test. all(x == z) has length 1 and so the returned valued is of length 1.

summarize_all rows by grouping and define which value should be kept

I have a data frame in which several data sources are merged. This creates rows with the same id. Now I want to define which values from which row should be kept.
So far I have been using dplyr with group_by and summarize all to keep the first value if it is not NA.
Here's an example:
# function f for summarizing
f <- function(x) {
x <- na.omit(x)
if (length(x) > 0) first(x) else NA
}
# test data
test <- data.frame(id = c(1,2,1,2), value1 = c("a",NA,"b","c"), value2 = c(0:4))
id value1 value2
1 a 0
2 <NA> 1
1 b 2
2 c 3
The following result is obtained when merging
test <- test %>% group_by(id) %>% summarise_all(funs(f))
id value1 value2
1 a 0
2 c 1
Now the question: that NA (na.omit) be replaced already works, but how can I define that not the numerical value 0, but the value not equal to 0 is accepted. So the expected result looks like this:
id value1 value2
1 a 2
2 c 1

You can just modify your f function by subsetting the vector where it is different from zero
f <- function(x) {
x <- na.omit(x)
x <- x[x != 0]
if (length(x) > 0) first(x) else NA
}
Sidenote: as of dplyr 0.8.0, funs is deprecated. You should a lambda, a list of functions or a list of lambdas. In this case I used a single lambda:
test %>%
group_by(id) %>%
summarise_all(~f(.))
# A tibble: 2 x 3
id value1 value2
<dbl> <chr> <int>
1 1 a 2
2 2 c 1

You can write f function as :
library(dplyr)
f <- function(x) x[!is.na(x) & x != 0][1]
test %>% group_by(id) %>% summarise(across(.fns = f))
# id value1 value2
# <dbl> <chr> <int>
#1 1 a 2
#2 2 c 1
Using [1] would return NA automatically if there are no non-zero or non-NA value in your data.

As a sidenote to the sidenote of #RicS, as of dplyr v1+, summarise_all() is deprecated (superseded). You should rather use across():
test %>%
group_by(id) %>%
summarise(across(.f=f))

how to use tidyeval functions with loops?

Consider this simple example
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
> dataframe
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
and this tidyeval function that uses dplyr to aggregate my dataframe according to some input column.
func_tidy <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
now, this works
> func_tidy(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
but doing the same thing from within a loop FAILS
for(col in c(group)){
func_tidy(dataframe, col)
}
Error in grouped_df_impl(data, unname(vars), drop) : Column `col` is unknown
What is the problem here? How can I use my tidyeval function in a loop?
Thanks!

For looping through column names you will need to use character strings.
for(col in "group")
When you pass this variable to your function, you will need to convert it from a character string to a symbol using rlang::sym. You use !! to unquote so the expression is evaluated.
So your loop would look like (I add a print to see the output):
for(col in "group"){
print( func_tidy(dataframe, !! rlang::sym(col) ) )
}
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2

how to write a function that uses broom, dplyr and lm?

Consider this very simple example
library(dplyr)
library(broom)
dataframe <- data_frame(id = c(1,2,3,4,5,6),
group = c(1,1,1,2,2,2),
value = c(200,400,120,300,100,100))
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
Here I want to write a function that outputs the upper bound of the confidence estimate for the mean of value. That is,
get_ci_high <- function(data, myvar){
confint_tidy(lm(data = data, myvar ~ 1)) %>% pull(conf.high)
}
Now, this works easily
confint_tidy(lm(data = dataframe, value ~ 1)) %>% pull(conf.high)
[1] 332.9999
This works as well (note the call after a group_by)
dataframe %>% group_by(group) %>% mutate(dealwithit = get_ci_high(., value))
# A tibble: 6 x 4
# Groups: group [2]
id group value dealwithit
<dbl> <dbl> <dbl> <dbl>
1 1 1 200 598.2674
2 2 1 400 598.2674
3 3 1 120 598.2674
4 4 2 300 453.5102
5 5 2 100 453.5102
6 6 2 100 453.5102
This works wonderfully
mindblow <- function(data, groupvar, outputvar){
quo_groupvar <- enquo(groupvar)
quo_outputvar <- enquo(outputvar)
data %>% group_by(!!quo_groupvar) %>%
summarize(output = get_ci_high(., !!quo_outputvar))%>%
ungroup()
}
> mindblow(dataframe, groupvar = group, outputvar = value)
# A tibble: 2 x 2
group output
<dbl> <dbl>
1 1 598.2674
2 2 453.5102
... but this FAILS
get_ci_high(dataframe, value)
Error in eval(expr, envir, enclos) : object 'value' not found
I dont get what is wrong here. I really need a solution that works in the four cases above.
Any ideas?
Many thanks!!

The reason is that when you pass the value argument, you want R to use its name "value" in the formula, rather than the value of the variable (which doesn't exist).
One solution would be to extract the name using substitute() (non-standard evaluation), and create a formula using as.formula:
get_ci_high <- function(data, myvar) {
col_name <- as.character(substitute(myvar))
fmla <- as.formula(paste(col_name, "~ 1"))
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value)
However, I'd strongly recommend passing the formula value ~ 1 as the second argument instead. This is both simpler and more flexible for performing other linear models (when you have predictors as well).
get_ci_high <- function(data, fmla) {
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value ~ 1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - dplyr. Functions with variable similar to dataframe columns - r

We can change the function to evaluate the 'cond' not from the environment f_ = function(data, cond) { data %>% filter(b == !!cond) } f_(d, cond = 2) # A tibble: 1 x 2 # cond b # <dbl> <dbl> #1 2 2

Related

Create several new variables using a vector of names and a vector for computation within dplyr::mutate

grouped summarize still gives result for each individual row

summarize_all rows by grouping and define which value should be kept

how to use tidyeval functions with loops?

how to write a function that uses broom, dplyr and lm?

Categories

Resources