Equivalent of next in purrr::map_df - r

I am looking for the equivalent of next in loops for a purrr::map_df call.
map_df plays nicely with dataframes that are NULL (as in the example below), so it works when I set Result <- NULL in my example below.
Could anyone suggest a general solution to my illustration below that would not require me setting Result <- NULL, but rather immediately go "next".
library(tidyverse)
set.seed(1000)
df <- data.frame(x = rnorm(100), y = rnorm(100), z = rep(LETTERS, 100))
Map_Func <- function(df) {
Sum_Num <- suppressWarnings(sqrt(sum(df$y)))
if( Sum_Num == "NaN" ) {
Result <- NULL
# I would like to have an equivalent to "next" here...
} else {
Result <- df %>% filter(y == max(y)) %>% mutate(Result = x*y)
}
Result
}
Test <- split(df, df$z) %>% map_df(~Map_Func(.))
In the code above, what can I use instead of Result <- NULL in the ugly if statement (i.e. I want to simply check a condition and effectively do a "next").

To exit a function you can use the return(<output>) command. This immediately exits the function with the output you define. The following gives the same output you were getting with your sample code.
library(tidyverse)
set.seed(1000)
df <- data.frame(x = rnorm(100), y = rnorm(100), z = rep(LETTERS, 100))
Map_Func <- function(df) {
Sum_Num <- suppressWarnings(sqrt(sum(df$y)))
if( Sum_Num == "NaN" ) {
return(NULL)
}
Result <- df %>% filter(y == max(y)) %>% mutate(Result = x*y)
}
Test <- split(df, df$z) %>% map_df(~Map_Func(.))

Logic wise not a very different solution than OP but trying to keep it clean by using separate functions. custom_check function is to check the condition for each group. Using map_if we apply the function Map_Func_true only when custom_check returns TRUE or else apply Map_Func_false which returns NULL and finally bind the rows.
library(tidyverse)
Map_Func_true <- function(df) {
df %>% filter(y == max(y)) %>% mutate(Result = x*y)
}
Map_Func_false <- function(df) { return(NULL) }
custom_check <- function(df) {
!is.nan(suppressWarnings(sqrt(sum(df$y))))
}
df %>%
group_split(z) %>%
map_if(., custom_check, Map_Func_true, .else = Map_Func_false) %>%
bind_rows()
# A tibble: 26 x 4
# x y z Result
# <dbl> <dbl> <fct> <dbl>
# 1 1.24 2.00 A 2.47
# 2 1.24 2.00 A 2.47
# 3 1.24 2.00 C 2.47
# 4 1.24 2.00 C 2.47
# 5 1.24 2.00 E 2.47
# 6 1.24 2.00 E 2.47
# 7 1.24 2.00 G 2.47
# 8 1.24 2.00 G 2.47
# 9 1.24 2.00 I 2.47
#10 1.24 2.00 I 2.47
# … with 16 more rows

Here's another way of looking at it using purrr::safely
Map_Func <- function(df) {
Sum_Num <- suppressWarnings(sqrt(sum(df$y)))
df %>% filter(y == max(y)) %>% mutate(Result = x*y)
}
Test <- split(df, df$z) %>%
map(safely(~Map_Func(.))) %>%
transpose() %>%
pluck("result") %>% # use 'error' here to get the error log
bind_rows()
This way the function becomes cleaner and you also get a nice log of errors

Related

How to replace mutate_ with mutate when using a series to replace swap columns?

I am looking to replace mutate_ with mutate since there are deprecation warnings now and am unsure how to use some of the answers I have sought out on Stack Overflow. This answer has the deprecated quosure issue and not sure how this one can be applied.
library(tibble)
library(dplyr)
library(magrittr)
library(rlang)
# two data frames/tibbles
df1 <-
data.frame(
w = c(0,9,8),
x = c(1,2,3),
y = c(4,5,6)
) %>% tibble()
df2 <-
data.frame(
x = c(9,9,9),
y = c(1,1,1),
z = c(6,6,6)
) %>% tibble()
# the original function
swapThem <- function(to, from) {
cols <- colnames(from)
if (length(cols) != 0) {
# Loop through `from` columns and if there's a match in `to`, copy and paste
# it into `to`
for (i in seq_along(cols)) {
col <- cols[i]
if (col %in% colnames(to)) {
print(col)
dots <-
stats::setNames(list(lazyeval::interp(
~ magrittr::use_series(from, x), x = as.name(col)
)), col)
to <- to %>%
#dplyr::mutate(.dots = dots)
dplyr::mutate_(.dots = dots)
} else {
next
}
}
}
return(to)
}
Here is a simpler base R alternative -
swapThem <- function(to, from) {
cols <- intersect(colnames(to), colnames(from))
if(length(cols)) to[cols] <- from[cols]
to
}
swapThem(df1, df2)
# A tibble: 3 × 3
# w x y
# <dbl> <dbl> <dbl>
#1 0 9 1
#2 9 9 1
#3 8 9 1
The output is similar when I run your code with swapThem(df1, df2) I get
#[1] "x"
#[1] "y"
# A tibble: 3 × 3
# w x y
# <dbl> <dbl> <dbl>
#1 0 9 1
#2 9 9 1
#3 8 9 1
There are easier ways to do this (see Ronak Shah's base R approach, for example), however since you specifically asked about how to switch from mutate_ to mutate, you can adjust your original code in this way:
swapThem <- function(to, from) {
cols <- colnames(from)
if (length(cols) != 0) {
# Loop through `from` columns and if there's a match in `to`, copy and paste
# it into `to`
for (i in seq_along(cols)) {
col <- cols[i]
if (col %in% colnames(to)) {
to <- to %>% dplyr::mutate(!!sym(col) := from[[col]])
} else {
next
}
}
}
return(to)
}
Note that you can also use {{}}, like this:
to <- to %>% dplyr::mutate({{col}} := from[[col]])
Here is another tidy approach, that uses bind_cols. The relocate is to ensure that the order of columns in to is preserved
swapThem <- function(to,from) {
bind_cols(
to %>% select(all_of(setdiff(colnames(to), colnames(from)))),
from %>% select(all_of(intersect(colnames(to), colnames(from))))
) %>%
relocate(colnames(to))
}

What is faster/better: Loop over each row of a dataframe or split it into a list of length `nrow` , R

I am just wondering if this is a serious tradeoff one should consider.
Let's say you have a dataframe in R and want to perform an operation on each observation (row).
I know it is already a delicate issue to iterate over the rows, so I was just wondering which of the three option:
Normal for loop over each row
Split dataframe into a list of nrow elements and apply operation on each element and bind the result together
Do the same as above in parallel
Without any benchmarking or so, this is basically what I am asking in pseudocode:
library(future.apply)
n = 1000000
x = 1:n
y = x + rnorm(n, mean=50, sd=50)
df = data.frame(
x = x,
y = y
)
# 1)
# iterating over each row with normal for loop
for(r in 1:nrow(df)){
row = df[r, ]
r = f(row)
df[r, ] = row
}
# 2)
# create a list of length nrow(df) and apply do something to each list element
# and rowbind it together
res = df %>% split(., .$x) %>% lapply(., function(x){
r = f(x)
})
bind_rows(res, .id="x")
# 3)
# create a list of length nrow(df) and apply do something to each list element in parallel
# and rowbind it together
res = df %>% split(., .$x) %>% future_lapply(., function(x){
r = f(x)
})
bind_rows(res, .id="x")
Probably none of the above options is the best, so I'd be happy to head any thoughts on this. Sorry if this a very naive question. I am just getting started with R.
I very often use the scheme tibble %>% nest %>% mutate(map) %>% unnest.
Take a look at the example below.
library(tidyverse)
n = 10000
f = function(data) sqrt(data$x^2+data$y^2+data$z^2)
tibble(
x = 1:n,
y = x + rnorm(n, mean=50, sd=50),
z = x + y + rnorm(n, mean=50, sd=50)
) %>% nest(data = c(x:z)) %>%
mutate(l = map(data, f)) %>%
unnest(c(data, l))
output
# A tibble: 10,000 x 4
x y z l
<int> <dbl> <dbl> <dbl>
1 1 67.1 136. 151.
2 2 75.4 127. 148.
3 3 -11.1 38.9 40.6
4 4 58.1 106. 121.
5 5 23.5 126. 128.
6 6 73.4 179. 193.
7 7 44.5 121. 129.
8 8 106. 131. 169.
9 9 32.5 140. 144.
10 10 -27.7 82.7 87.8
# ... with 9,990 more rows
For me personally, it is very clear and elegant. But you can disagree with that.
Update 1
Honestly, your question also intrigued me in terms of performance. So I decided to check it out.
Here is the code:
library(tidyverse)
library(microbenchmark)
n = 1000
df = tibble(
x = 1:n,
y = x + rnorm(n, mean=50, sd=50),
z = x + y + rnorm(n, mean=50, sd=50)
)
f = function(data) sqrt(data$x^2+data$y^2+data$z^2)
f1 = function(df){
df %>% nest(data = c(x:z)) %>%
mutate(l = map(data, f)) %>%
unnest(c(data, l))
}
f1(df)
f2 = function(df){
df = df %>% mutate(l=NA)
for(r in 1:nrow(df)){
row = df[r, ]
df$l[r] = f(row)
}
df
}
f2(df)
f3 = function(df){
res = df %>%
split(., .$x) %>%
lapply(., f)
df %>% bind_cols(l = unlist(res))
}
f3(df)
ggplot2::autoplot(microbenchmark(f1(df), f2(df), f3(df), times=100))
Here is the result:
Do I have to add anything else and explain why the scheme tibble%>% nest%>% mutate (map)%>% unnest is so cool?

Mutate by condition with many columns with each one a different setting

I have looking for but not found how make a simple if for many columns in dplyr.
I have this code (it works):
library(dplyr)
library(magrittr)
data("PlantGrowth")
PlantGrowth %>% mutate (
a=if_else(group=="ctrl", weight*2, weight*100),
b=if_else(group=="ctrl", weight*1,5, weight/100),
c=if_else(group=="ctrl", weight*4, weight*100),
d=if_else(group=="ctrl", weight*5, weight/1000)
)
And I would like to not repeat the condition. Something like that:
PlantGrowth %>% mutate_if_foo (
group=="ctrl",{
a=weight*2,
b=weight*1,5,
c=weight*4,
d=weight*5
}
)%>% mutate_if_foo (
group!="ctrl",{
a=weight*100,
b=weight/100),
c=weight*100),
d=weight/1000)
}
)
I've found many answers on mutate_if,mutate_all, mutate_at , case_when but they don't answer at my question.
Please with dplyr / tidyverse.
Thanks in advance
EDIT
I've tried, from #Rohit_das idea about functions.
mtcars %>% ( function(df) {
if (df$am==1){
df%>% mutate(
a=df$mpg*3,
b=df$cyl*10)
}else{
df%>% mutate(
a=df$disp*300,
d=df$cyl*1000)
}
})
but I have Warning message:
In if (df$am == 1) { :
the condition has length > 1
and only the first element will be used
Not sure I understand the issue here. If you just want to reduce the verbosity of the code then just create a custom function
customif = function(x,y) {
if_else(group=="ctrl", weight*x, weight*y)
}
then you can call this function in your mutate as
PlantGrowth %>% mutate (
a=customif(2,100),
b=customif(1,5, 1/100),
c=customif(4, 100),
d=customif(5, 1/1000)
)
I think I found a neat solution with purrr. It takes a data frame of inputs and then dynamically names new columns a:d with new inputs for each column. First column will use x = 2, y = 100 and z = "a" and then the next row, and so on. The cool thing with functional programming like this is that it is very easy to scale up.
library(tidyverse)
iterate <- tibble(x = c(2, 1.5, 4, 5),
y = c(100, 1/100, 100, 1/1000),
z = c("a", "b", "c", "d"))
fun <- function(x, y, z) {
PlantGrowth %>%
mutate(!!z := if_else(group == "ctrl", weight * x, weight * y)) %>%
select(3)
}
PlantGrowth %>%
bind_cols(
pmap_dfc(iterate, fun)
) %>%
as_tibble
Which gives you the same df:
# A tibble: 30 x 6
weight group a b c d
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 4.17 ctrl 8.34 6.26 16.7 20.8
2 5.58 ctrl 11.2 8.37 22.3 27.9
3 5.18 ctrl 10.4 7.77 20.7 25.9
4 6.11 ctrl 12.2 9.17 24.4 30.6
5 4.5 ctrl 9 6.75 18 22.5
I think I've found an answer. I tested on mtcars. I didn't test yet on my real code.
Comment please if I you think I am wrong in the concept.
The conditions of the filters have to be exclusives else I will take duplicate lines.
library(dplyr)
library(magrittr)
library(tibble) # only if necessary to preserve rownames
mtcars %>% ( function(df) {
rbind(
(df
%>% tibble::rownames_to_column(.) %>%tibble::rowid_to_column(.) # to preserve rownames
%>%dplyr::filter(am==1)
%>%dplyr::mutate(
a=mpg*3,
b=cyl*10,d=NA)),
(df
%>% tibble::rownames_to_column(.) %>%tibble::rowid_to_column(.) # to preserve rownames
%>%dplyr::filter(am!=1)
%>%dplyr::mutate(
a=disp*3,
d=cyl*100,b=NA))
)
}) %>%arrange(rowid)

Passing multiple arguments to function in dplyr::summarise_if

I am trying to make a function that uses summarise_if (or summarise_at) to calculate the correlation between one column and many others in the data set.
data_set <- data.frame(grp = rep(c("a","b","c"), each =
3), x = rnorm(9), y = rnorm(9), z = rnorm(9))
multiple_cor <- function(d, vars){
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, cor, x) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
This gives the error:
Error in dots_list(...) : object 'x' not found
Called from: dots_list(...)
I'm am fairly sure this is from the cor function not evaluating x within the right environment, but I am not sure how to get around this issue.
summarise_at has a funs argument so it can handle anonymous functions. I created a function called cors inside your function and pass that one on to summarise_at inside the funs argument to handle the x.
multiple_cor <- function(d, vars){
cors <- function(x, a = NULL) {
stats::cor(x, a)
}
d %>%
dplyr::group_by(grp) %>%
dplyr::summarise_at(vars, funs(cors(x, .))) %>%
return()
}
multiple_cor(data_set, vars = c("y","z") )
# A tibble: 3 x 3
grp y z
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
The outcome of the function is exactly identical as the following lines of code:
data_set %>%
group_by(grp) %>%
summarise(cxy = cor(x, y),
cxz = cor(x, z))
# A tibble: 3 x 3
grp cxy cxz
<fct> <dbl> <dbl>
1 a 0.803 0.894
2 b -0.284 -0.949
3 c 0.805 -0.571
Read this dplyr documentation.
And this google groups discussion.

Dplyr/tidyverse: rename_at `.funs` must contain one renaming function, not 4

I have the following test data:
library(tidyverse)
df <- tibble(
g1 = c(1, 1, 2, 2, 2),
g2 = c(a, a, a, b, b),
a = sample(5),
b = sample(5)
)
I would like to write a function that summarises grouped columns with a mean and I wish I could have the resulting columns prefixed with "mean_"
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, .funs= mean) %>%
rename_at(.vars= summarise_var, .funs=paste('mean_', .))
}
Without rename_at line it works fine, but with it throws error:
my_summarise1(df, vars(g1,g2),vars(a,b))
R responds with
Error: `.funs` must contain one renaming function, not 4
How should I effectively prefix the new column names?
Smaller question: is it possible to avoid vars() or quotes arount parameters
column names when calling a function?
Knowing these two small things would greatly enhance my code, thank you all very much in advance for help.
While the earlier answer by #docendodiscimus is more succinct, for what it's worth, there are two issues with your code:
You need to wrap the paste (better: paste0) function within funs.
You need to ungroup prior to renaming (see e.g. this post).
A working version of your code looks like this:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(group_var) %>%
summarise_at(summarise_var, mean) %>%
ungroup() %>%
rename_at(summarise_var, funs(paste0('mean_', .)))
}
my_summarise1(df, vars(g1, g2), vars(a, b))
## A tibble: 3 x 4
# g1 g2 mean_a mean_b
# <dbl> <chr> <dbl> <dbl>
#1 1. a 2.50 2.50
#2 2. a 4.00 5.00
#3 2. b 3.00 2.50
If you want to take a simple route, you can use dplyr's way of adding suffixes to the summarised columns:
my_summarise1 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise1(df, vars(g1,g2), vars(a,b))
# A tibble: 3 x 4
# Groups: g1 [?]
g1 g2 a_mean b_mean
<dbl> <chr> <dbl> <dbl>
1 1. a 3.50 4.50
2 2. a 4.00 1.00
3 2. b 2.00 2.50
In this case, funs(mean=mean) tells dplyr to use the suffix mean and apply the function mean. For clarity, you could use funs(mysuffix = mean) to use any different suffix and apply the mean function.
Re OP's question in comment: you can use the following modification which doesn't require the use of vars when calling the function.
my_summarise2 <- function(df, group_var, summarise_var) {
df %>%
group_by_at(.vars = group_var) %>%
summarise_at(.vars = summarise_var, funs(mean=mean))
}
my_summarise2(df, c("g1","g2"), c("a","b"))

Resources