how to write a function that uses broom, dplyr and lm?

how to write a function that uses broom, dplyr and lm? - r

Consider this very simple example
library(dplyr)
library(broom)
dataframe <- data_frame(id = c(1,2,3,4,5,6),
group = c(1,1,1,2,2,2),
value = c(200,400,120,300,100,100))
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
Here I want to write a function that outputs the upper bound of the confidence estimate for the mean of value. That is,
get_ci_high <- function(data, myvar){
confint_tidy(lm(data = data, myvar ~ 1)) %>% pull(conf.high)
}
Now, this works easily
confint_tidy(lm(data = dataframe, value ~ 1)) %>% pull(conf.high)
[1] 332.9999
This works as well (note the call after a group_by)
dataframe %>% group_by(group) %>% mutate(dealwithit = get_ci_high(., value))
# A tibble: 6 x 4
# Groups: group [2]
id group value dealwithit
<dbl> <dbl> <dbl> <dbl>
1 1 1 200 598.2674
2 2 1 400 598.2674
3 3 1 120 598.2674
4 4 2 300 453.5102
5 5 2 100 453.5102
6 6 2 100 453.5102
This works wonderfully
mindblow <- function(data, groupvar, outputvar){
quo_groupvar <- enquo(groupvar)
quo_outputvar <- enquo(outputvar)
data %>% group_by(!!quo_groupvar) %>%
summarize(output = get_ci_high(., !!quo_outputvar))%>%
ungroup()
}
> mindblow(dataframe, groupvar = group, outputvar = value)
# A tibble: 2 x 2
group output
<dbl> <dbl>
1 1 598.2674
2 2 453.5102
... but this FAILS
get_ci_high(dataframe, value)
Error in eval(expr, envir, enclos) : object 'value' not found
I dont get what is wrong here. I really need a solution that works in the four cases above.
Any ideas?
Many thanks!!

The reason is that when you pass the value argument, you want R to use its name "value" in the formula, rather than the value of the variable (which doesn't exist).
One solution would be to extract the name using substitute() (non-standard evaluation), and create a formula using as.formula:
get_ci_high <- function(data, myvar) {
col_name <- as.character(substitute(myvar))
fmla <- as.formula(paste(col_name, "~ 1"))
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value)
However, I'd strongly recommend passing the formula value ~ 1 as the second argument instead. This is both simpler and more flexible for performing other linear models (when you have predictors as well).
get_ci_high <- function(data, fmla) {
confint_tidy(lm(data = data, fmla)) %>% pull(conf.high)
}
get_ci_high(dataframe, value ~ 1)

Related

dplyr mutate and purrr map: use data masking to select columns for map

In a dplyr mutate context, I would like to select the column a function is applied to by purrr:map using the value of another column.
Let's take a test data frame
test <- data.frame(a = c(1,2), b = c(3,4), selector = c("a","b"))
I want to apply following function
calc <- function(col)
{res <- col ^ 2
return(res)
}
I am trying something like this:
test_2 <- test %>% mutate(quad = map(.data[[selector]], ~ calc(.x)))
My expected result would be:
a b selector quad
1 1 3 a 1
2 2 4 b 16
but I get
Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?
I know .data[[var]] is supposed to be used only in special context of function programming, but also if I wrap this in functions or similar I cannot get it done. Trying to use tidy-selection gives the error that selecting helpers can only be used in special dplyr verbs, not functions like purrr:map.
how to use dynamic variable in purrr map within dplyr
hinted me to use get() and anonymous functions, but this also did not work in this context.

Here's one way:
test %>%
mutate(quad = map(seq_along(selector), ~ calc(test[[selector[.x]]])[.x]))
# a b selector quad
# 1 1 3 a 1
# 2 2 4 b 16
Instead of .data, you can also cur_data (which accounts for grouping):
test %>%
mutate(quad = map(seq(selector), ~ calc(cur_data()[[selector[.x]]])[.x]))
Or, with diag:
test %>%
mutate(quad = diag(as.matrix(calc(cur_data()[selector]))))
# a b selector quad
#1 1 3 a 1
#2 2 4 b 16

You can use rowwise() and get() the selector variable:
library(dplyr)
test %>%
rowwise() %>%
mutate(quad = calc(get(selector))) %>%
ungroup()
# A tibble: 2 × 4
a b selector quad
<dbl> <dbl> <chr> <dbl>
1 1 3 a 1
2 2 4 b 16
Or if the selector repeats, group_by() will be more efficient:
test <- data.frame(a = c(1,2,5), b = c(3,4,6), selector = c("a","b","a"))
test %>%
group_by(selector) %>%
mutate(quad = calc(get(selector[1]))) %>%
ungroup()
# A tibble: 3 × 4
a b selector quad
<dbl> <dbl> <chr> <dbl>
1 1 3 a 1
2 2 4 b 16
3 5 6 a 25

You could also change the function to return a single number and use purrr:
calc <- function(col, id) {test[[col]][[id]]^2}
test %>%
mutate(
quad = purrr::map2_dbl(selector, row_number(), calc)
)
a b selector quad
1 1 3 a 1
2 2 4 b 16

Using base R:
test$quad <- calc(test[,test$selector][cbind(seq_len(nrow(test)), test$selector)])
(R version 3.5.3 where strings are converted to factors in data.frame)

Not quite what you asked for but an alternative might be to restructure the data so that the calculation is easier:
test %>%
pivot_longer(
cols = c(a, b)
) %>%
filter(name == selector) %>%
mutate(quad = value**2)
# A tibble: 2 × 4
selector name value quad
<chr> <chr> <dbl> <dbl>
1 a a 1 1
2 b b 4 16
You can join the results back onto the original data using an id column.

Create several new variables using a vector of names and a vector for computation within dplyr::mutate

I'd like to create several new columns. They should take their names from one vector and they should be computed by taking one column in the data and dividing it by another.
mytib <- tibble(id = 1:2, value1 = c(4,6), value2 = c(42, 5), total = c(2,2))
myvalues <- c("value1", "value2")
mynames <- c("value1_percent", "value2_percent")
mytib %>%
mutate({{ mynames }} := {{ myvalues }}/total)
Here, I get the error message, which makes me think that the curly-curly operator is misplaced
Error in local_error_context(dots = dots, .index = i, mask = mask) : promise already under evaluation: recursive default argument reference or earlier problems?
I'd like to calculate the percentage columns programmatically (since I have many such columns in my data).
The desired output should be equivalent to this:
mytib %>%
mutate( "value1_percent" = value1/total, "value2_percent" = value2/total)
which gives
# A tibble: 2 × 6
id value1 value2 total value1_percent value2_percent
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 42 2 2 21
2 2 6 5 2 3 2.5

You could use across and construct the new names in its .names argument:
library(dplyr)
mytib %>%
mutate(across(starts_with('value'),
~ .x / total,
.names = "{.col}_percent"
))

I prefer mutate(across(...)) in this case. To make your idea work, try reduce2() from purrr.
library(dplyr)
library(purrr)
reduce2(mynames, myvalues,
~ mutate(..1, !!..2 := !!sym(..3)/total), .init = mytib)
# # A tibble: 2 x 6
# id value1 value2 total value1_percent value2_percent
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 4 42 2 2 21
# 2 2 6 5 2 3 2.5
The above code is actually a shortcut of:
mytib %>%
mutate(!!mynames[1] := !!sym(myvalues[1])/total,
!!mynames[2] := !!sym(myvalues[2])/total)

Create multiple data that count for unique values of each variables using dplyr and loop

I have some question for programming using dplyr and for loop in order to create multiple data. The code without loop works very well, but the code with for loop doesn't give me the expected result as well as error message.
Error message was like:
"Error in UseMethod ("select_") : no applicable method for 'select_'
applied to an object of class "character"
Please anyone put me on the right way.
The code below worked
B <- data %>% select (column1) %>% group_by (column1) %>% arrange (column1) %>% summarise (n = n ())
The code below did not work
column_list <- c ('column1', 'column2', 'column3')
for (b in column_list) {
a <- data %>% select (b) %>% group_by (b) %>% arrange (b) %>% summarise (n = n () )
assign (paste0(b), a)
}

Don't use assign. Instead use lists.
We can use _at variations in dplyr which works with characters variables.
library(dplyr)
split_fun <- function(df, col) {
df %>% group_by_at(col) %>% summarise(n = n()) %>% arrange_at(col)
}
and then use lapply/map to apply it to different columns
purrr::map(column_list, ~split_fun(data, .))
This will return you a list of dataframes which can be accessed using [[ individually if needed.
Using example with mtcars
df <- mtcars
column_list <- c ('cyl', 'gear', 'carb')
purrr::map(column_list, ~split_fun(df, .))
#[[1]]
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
#[[2]]
# A tibble: 3 x 2
# gear n
# <dbl> <int>
#1 3 15
#2 4 12
#3 5 5
#[[3]]
# A tibble: 6 x 2
# carb n
# <dbl> <int>
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1

how to use tidyeval functions with loops?

Consider this simple example
library(dplyr)
dataframe <- data_frame(id = c(1,2,3,4),
group = c('a','b','c','c'),
value = c(200,400,120,300))
> dataframe
# A tibble: 4 x 3
id group value
<dbl> <chr> <dbl>
1 1 a 200
2 2 b 400
3 3 c 120
4 4 c 300
and this tidyeval function that uses dplyr to aggregate my dataframe according to some input column.
func_tidy <- function(data, mygroup){
quo_var <- enquo(mygroup)
df_agg <- data %>%
group_by(!!quo_var) %>%
summarize(mean = mean(value, na.rm = TRUE),
count = n()) %>%
ungroup()
df_agg
}
now, this works
> func_tidy(dataframe, group)
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2
but doing the same thing from within a loop FAILS
for(col in c(group)){
func_tidy(dataframe, col)
}
Error in grouped_df_impl(data, unname(vars), drop) : Column `col` is unknown
What is the problem here? How can I use my tidyeval function in a loop?
Thanks!

For looping through column names you will need to use character strings.
for(col in "group")
When you pass this variable to your function, you will need to convert it from a character string to a symbol using rlang::sym. You use !! to unquote so the expression is evaluated.
So your loop would look like (I add a print to see the output):
for(col in "group"){
print( func_tidy(dataframe, !! rlang::sym(col) ) )
}
# A tibble: 3 x 3
group mean count
<chr> <dbl> <int>
1 a 200 1
2 b 400 1
3 c 210 2

How to refer to a tibble column, using a variable name, in a pipe (R)

I am pretty new to R, so this question may be a bit naive.
I have got a tibble with several columns, and I want to create a factor (Bin) by binning the values in one of the columns in N bins. Which is done in a pipe. However, I would like to be able to define the column to be binned at the top of the script (e.g. bin2use = RT), because I want this to be flexible.
I've tried several ways of referring to a column name using this variable, but I cannot get it to work. Amongst others I have tried get(), eval(), [[]]
simplified example code
Subject <- c(rep(1,100), rep(2,100))
RT <- runif(200, 300, 800 )
data_st <- tibble(Subject, RT)
bin2use = 'RT'
nbin = 5
binned_data <- data_st %>%
group_by(Subject) %>%
mutate(
Bin = cut_number(get(bin2use), nbin, label = F)
)
Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator

We can use a non-standard evaluation with `lazyeval
library(dplyr)
library(ggplot2)
f1 <- function(colName, bin){
call <- lazyeval::interp(~cut_number(a, b, label = FALSE),
a = as.name(colName), b = bin)
data_st %>%
group_by(Subject) %>%
mutate_(.dots = setNames(list(call), "Bin"))
}
f1(bin2use, nbin)
#Source: local data frame [200 x 3]
#Groups: Subject [2]
# Subject RT Bin
# <dbl> <dbl> <int>
#1 1 752.2066 5
#2 1 353.0410 1
#3 1 676.5617 4
#4 1 493.0052 2
#5 1 532.2157 3
#6 1 467.5940 2
#7 1 791.6643 5
#8 1 333.1583 1
#9 1 342.5786 1
#10 1 637.8601 4
# ... with 190 more rows

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to write a function that uses broom, dplyr and lm? - r

Related

dplyr mutate and purrr map: use data masking to select columns for map

Create several new variables using a vector of names and a vector for computation within dplyr::mutate

Create multiple data that count for unique values of each variables using dplyr and loop

how to use tidyeval functions with loops?

How to refer to a tibble column, using a variable name, in a pipe (R)

Categories

Resources