I would like to define a wrapper to an inner function.
The idea is to repeat random sampling that uses one of r* base function (eg runif, rnorm, etc.) and let the user easily change this inner function and define custom ones.
The example below show a reproducible example that I cannot make work with tidyeval patterns and, more precisely, within a purrr::map. The evaluation of ... seems to not happen properly. I missed something on quosures evaluation but I cannot figure what. I also show below a workaround that works fine with a goold old replicate.
I would like to implement such behaviour in more complex cases and, more generally, be delighted for any pointer and to understand why the following does not work.
# use the tidyverse and define a dummy tibble
library(tidyverse)
df <- tibble(col1=seq(10, 50, 10), col2=col1+5)
# first random function, on top of stats::runif
random_1 <- function(x, min, max){
x %>%
rowwise() %>%
mutate(new=runif(1, min={{min}}, max={{max}})) %>%
ungroup()
}
# second random function, on top of stats::rnorm
random_2 <- function(x, mean, sd){
x %>%
rowwise() %>%
mutate(new=rnorm(1, mean={{mean}}, sd={{sd}})) %>%
ungroup()
}
# at top level, everything works fine
> df %>% random_1(min=col1, max=col2)
> df %>% random_2(mean=col1, sd=col2)
# So far so good
# we we wrap it for a single shot
random_fun <- function(x, random_fun, ...){
random_fun(x, ...)
}
random_fun(df, random_1, min=col1, max=col2)
# Still fine.
# Here comes the trouble:
random_fun_k <- function(df, k, random_fun, ...){
map(1:k, ~random_fun(df, ...))
}
random_fun_k(df, k=2, random_1, min=col1, max=col2)
Error in is_quosure(x) : argument "x" is missing, with no default
The following workaround around replicate works fine yet I would like to stick to tidyeval spirit:
random_fun_k_oldie <- function(df, k, random_fun, ...){
f <- random_fun(df, ...)
replicate(k, f, simplify=FALSE)
}
random_fun_k_oldie(df, k=2, random_1, min=col1, max=col2)
random_fun_k_oldie(df, k=2, random_2, mean=col1, sd=col2)
It may be better to use original lambda function i.e. function(x)
library(purrr)
random_fun_k <- function(df, k, random_fun, ...){
map(seq_len(k), function(x) random_fun(df, ...))
}
-testing
> random_fun_k(df, k=2, random_1, min=col1, max=col2)
[[1]]
# A tibble: 5 × 3
col1 col2 new
<dbl> <dbl> <dbl>
1 10 15 12.6
2 20 25 21.4
3 30 35 34.1
4 40 45 40.7
5 50 55 53.8
[[2]]
# A tibble: 5 × 3
col1 col2 new
<dbl> <dbl> <dbl>
1 10 15 13.1
2 20 25 24.2
3 30 35 33.8
4 40 45 41.6
5 50 55 50.9
NOTE: The function name and argument name seems to be the same rand_fun and this could cause some confusion as well (though it is not the source of the error). It may be better to rename the function argument differently
random_fun <- function(x, rn_fun, ...){
rn_fun(x, ...)
}
purrr's lambdas support positional arguments with the ..1, ..2, etc syntax. This is implemented via the ... mechanism. Because of this, you're not passing the correct arguments to random_fun.
The solution is to use a normal lambda function as akrun suggested. Maybe you could use the \(x) x syntax of R 4.0.
Related
I'm experimenting with using functions in dataframes (tidyverse tibbles) in R and I ran into some difficulties. The following is a minimal (trivial) example of my problem.
Suppose I have a function that takes in three arguments: x and y are numbers, and f is a function. It performs f(x) + y and returns the output:
func_then_add = function(x, y, f) {
result = f(x) + y
return(result)
}
And I have some simple functions it might use as f:
squarer = function(x) {
result = x^2
return(result)
}
cuber = function(x) {
result = x^3
return(result)
}
Done on its own, func_then_add works as advertised:
> func_then_add(5, 2, squarer)
[1] 27
> func_then_add(6, 11, cuber)
[1] 227
But lets say I have a dataframe (tidyverse tibble) with two columns for the numeric arguments, and one column for which function I want:
library(tidyverse)
library(magrittr)
test_frame = tribble(
~arg_1, ~arg_2, ~func,
5, 2, squarer,
6, 11, cuber
)
> test_frame
# A tibble: 2 x 3
arg_1 arg_2 func
<dbl> <dbl> <list>
1 5 2 <fn>
2 6 11 <fn>
I then want to make another column result that is equal to func_then_add applied to those three columns. It should be 27 and 227 like before. But when I try this, I get an error:
> test_frame %>% mutate(result=func_then_add(.$arg_1, .$arg_2, .$func))
Error in f(x) : could not find function "f"
Why does this happen, and how do I get what I want properly? I confess that I'm new to "functional programming", so maybe I'm just making an obvious syntax error ...
Not the most elegant but we can do:
test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1, .$arg_2, .$func[[x]])))
EDIT: The above maps both over the entire data which isn't really what OP desires. As suggested by #January this can be better applied as:
Result <- test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
Result$Res
The above again is not very efficient since it returns a list. A better alternative(again as suggested by #January is to use map_dbl which returns the same data type as its objects:
test_frame %>%
mutate(Res= map_dbl(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
# A tibble: 2 x 4
arg_1 arg_2 func Res
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
This is because you should map instead of mutating. Mutate calls the function once, and supplies the whole columns as arguments.
The second problem is that test_frame$func[1] is not a function, but a list with one element. You can't have "function" columns, only list columns.
Try this:
test_frame$result <- with(test_frame,
map_dbl(1:2, ~ func_then_add(arg_1[.], arg_2[.], func[[.]])))
Result:
# A tibble: 2 x 4
arg_1 arg_2 func result
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
EDIT: a simpler solution using dplyr, mutate and rowwise:
test_frame %>% rowwise %>% mutate(res=func_then_add(arg_1, arg_2, func))
Quite frankly, I am slightly puzzled by this last one. Why func and not func[[1]]? func should be a list, and not function. mutate and rowwise are doing here something sinister, like automatically converting a list to a vector.
Edit 2: actually, this is written explicitly in the rowwise manual:
Its main impact is to allow you to work with list-variables in
‘summarise()’ and ‘mutate()’ without having to use ‘[[1]]’.
Final edit: I became so fixated on tidyverse recently that I did not think of the simplest option – using base R:
apply(test_frame, 1, function(x) func_then_add(x$arg_1, x$arg_2, x$func))
Is there any exception handeling mechanism in dplyr's mutate()? What I mean is a way to catch exceptions and handle them.
Let us suppose that I have a function that throws an error in some cases (in the example if the input is negative), for the sake of simplicity I define the function, but in real life it will be a function in some R package. Let us suppose this function is vectorized:
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
Now, let's suppose I want to use this function inside mutate().
If this function is used inside a mutate(), it stops at the first error and no result is returned:
library(dplyr)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
data %>% mutate(y = my_func_vect(x))
# Error in mutate_impl(.data, dots) : Evaluation error: x must be positive.
Is there a way to catch the error, and do something (e.g. return an NA) in this case, while getting results for the other elements?
The result I expect is what would be achieved using a loop with tryCatch(), i.e. something along the lines of:
y <- rep(NA_real_, length(data$x))
for(i in seq_along(data$x)) {
tryCatch({
y[i] <- my_func_vect(data$x[i])
}, error = function(err){})
}
y
# Result is: 1 NA 2 4
We can also make use of purrr's safely() or possibly() functions.
From the purrr help:
safely: wrapped function instead returns a list with components result and error. One value is always NULL.
quietly: wrapped function instead returns a list with components result, output, messages and warnings.
possibly: wrapped function uses a default value (otherwise) whenever an error occurs.
It doesn't change the fact that you have to apply the function to each row separately.
library(dplyr)
library(purrr)
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
With map:
data %>%
mutate(y = map_dbl(x, ~possibly(my_func_vect, otherwise = NA_real_)(.x)))
#> x y
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
Using rowwise():
data %>%
rowwise() %>%
mutate(y = possibly(my_func_vect, otherwise = NA_real_)(x))
#> Source: local data frame [4 x 2]
#> Groups: <by row>
#>
#> # A tibble: 4 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
The others functions are somewhat more difficult to use and apply in a 'data-frame environment', as they are more suited to work with lists, and returns such.
Created on 2018-05-15 by the reprex package (v0.2.0).
You want to evaluate every occuring error individually, maybe you shouldn't use the vectorized function. Instead use map from the purrr package- which is effectively the same as lapply here.
Make a function to catch the error for standard use if you want NA values in the case you get an error.
try_my_func <- function(x) {
tryCatch(my_func(x), error = function(err){NA})
}
Then use mutate with map
data %>% mutate(y = purrr::map(x, try_my_func))
x y
1 1 1
2 -1 NA
3 4 2
4 9 3
Or similarly, if you don't want to declare a new function.
data %>% mutate(y = purrr::map(x, ~ tryCatch(my_func(.), error = function(err){NA})))
And lastly if you Do want to use a Vectorized function, you can skip the map function altogether. But personally I never use Vectorize so I'd do it with map.
data %>% mutate(y = Vectorize(try_my_func)(x))
I don't know much about R, and I have a variables in a dataframe that I am trying to calculate some stats for, with the hope of writing them into a csv. I have been using a basic for loop, like this:
for(i in x) {
mean(my_dataframe[,c(i)], na.rm = TRUE))
}
where x is colnames(my_dataframe)
Not every variable is numeric - but when I add a print to the loop, this works fine - it just prints means when applicable, and NA when not. However, when I try to assign this loop to a value (means <- for....), it produces an empty list. Similarly, when I try to directly write the results to a csv, I get an empty csv. Does anyone know why this is happening/how to fix it?
this should work for you. you don't need a loop. just use the summary() function.
summary(cars)
The for loop executes the code inside, but it doesn't put any results together. To do that, you need to create an object to hold the results and explicitly assign each one:
my_means = rep(NA, ncol(my_dataframe)
for(i in seq_along(x)) {
my_means[i] = mean(my_dataframe[, x[i], na.rm = TRUE))
}
Note that I have also changed your loop to use i = 1, 2, 3, ... instead of each name.
sapply, as shown in another answer, is a nice shortcut that does the loop and combines the results for you, so you don't need to worry about pre-allocating the result object. It's also smart enough to iterate over columns of a data frame by default.
my_means_2 = sapply(my_dataframe, mean, na.rm = T)
Please give a reproducible example the next time you post a question.
Input is how I imagine your data would look like.
Input:
library(nycflights13)
library(tidyverse)
input <- flights %>% select(origin, air_time, carrier, arr_delay)
input
# A tibble: 336,776 x 4
origin air_time carrier arr_delay
<chr> <dbl> <chr> <dbl>
1 EWR 227. UA 11.
2 LGA 227. UA 20.
3 JFK 160. AA 33.
4 JFK 183. B6 -18.
5 LGA 116. DL -25.
6 EWR 150. UA 12.
7 EWR 158. B6 19.
8 LGA 53. EV -14.
9 JFK 140. B6 -8.
10 LGA 138. AA 8.
# ... with 336,766 more rows
The way I see it, there are 2 ways to do it:
Use summarise_all()
summarise_all() will summarise all your columns, including those that are not numeric.
Method:
input %>% summarise_all(funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 4
origin air_time carrier arr_delay
<dbl> <dbl> <dbl> <dbl>
1 NA 151. NA 6.90
Warning messages:
1: In mean.default(origin, na.rm = TRUE) :
argument is not numeric or logical: returning NA
2: In mean.default(carrier, na.rm = TRUE) :
argument is not numeric or logical: returning NA
You will get a result and a warning if you were to use this method.
Use summarise_if
summarise only numeric columns. You can avoid from getting any error this way.
Method:
input %>% summarise_if(is.numeric, funs(mean(., na.rm = TRUE)))
# A tibble: 1 x 2
air_time arr_delay
<dbl> <dbl>
1 151. 6.90
You can then create a NA column for others
You can use lapply or sapply for this sort of thing. e.g.
sapply(my_dataframe, mean)
will get you all the means. You can also give it your own function e.g.
sapply(my_dataframe, function(x) sum(x^2 + 2)/4 - 9)
If all variables are not numeric you can use summarise_if from dplyr to get the results just for the numeric columns.
require(dplyr)
my_dataframe %>%
summarise_if(is.numeric, mean)
Without dplyr, you could do
sapply(my_dataframe[sapply(my_dataframe, is.numeric)], mean)
I have this matrix and I want to use purrr::map2_dbl() to perform a function on each row - let's say to add- now, the following works:
add_2 <- function(a, b) a + b
sim <- tibble(
a = 1:4,
b = 5:8
)
sim %>% mutate(sum = map2_dbl(a, b, add_2))
HOWEVER, when I use map2_dbl() alone, it throws an error:
sim %>% map2_dbl(a, b, add_2)
Error in as_function(.f, ...) : object 'b' not found
I cannot understand why; also, it's interesting that pmap_dbl() works just fine.
sim %>% pmap_dbl(add_2) # 6, 8, 10, 12
Thanks for your help.
The problem is that the first works inside mutate because the variables a and b are visible within the scope. That is, mutate is making sim$a available as a, etc for the other columns. This is a function of mutate, not the %>% pipe.
The literal equivalent of sim %>% map2_dbl(a, b, add_2) is map2_dbl(sim, a, b, add_2), which I hope now clears up why that form does not work. You need the first two arguments to map to two vectors, as #akrun's comment provides.
Perhaps you would prefer:
with(sim, map2_dbl(a, b, add_2))
# [1] 6 8 10 12
or
within(sim, { mysum = map2_dbl(a, b, add_2) })
# # A tibble: 4 × 4
# a b sum mysum
# <int> <int> <dbl> <dbl>
# 1 1 5 6 6
# 2 2 6 8 8
# 3 3 7 10 10
# 4 4 8 12 12
Neither with nor within are tidyverse functions, though they do provide some easier-to-read code at times.
sim %>% pmap_dbl(add_2), which is equivalent to pmap_dbl(sim, add_2), works because the first argument to pmap* functions is a list, and since sim is a tibble, it inherits data.frame which is just a list where the elements are vectors (or lists) of equal lengths. Your call is equivalent to pmap_dbl(list(sim$a, sim$b), add_2) (verbose and overkill, sure).
Because of the piping
sim %>%
map2_dbl(a, b, add_2)
translates to:
map2_dbl(.x = sim, .y = a, .f = b, ... = add_2)
which does not make sense, and indeed the error warns b does not exists.
Similarly
sim %>%
pmap_dbl(add_2)
translates to:
pmap_dbl(.l = sim, .f = add_2)
which does work.
I use ddply to summarize some data.frameby various categories, like this:
# with both group and size being factors / categorical
split.df <- ddply(mydata,.(group,size),summarize,
sumGroupSize = sum(someValue))
This works smoothly, but often I like to calculate ratios which implies that I need to divide by the group's total. How can I calculate such a total within the same ddply call?
Let's say I'd like to have the share of observations in group A that are in size class 1. Obviously I have to calculate the sum of all observations in size class 1 first.
Sure I could do this with two ddply calls, but using all one call would be more comfortable. Is there a way to do so?
EDIT:
I did not mean to ask overly specific, but I realize I was disturbing people here. So here's my specific problem. In fact I do have an example that works, but I don't consider it really nifty. Plus it has a shortcoming that I need to overcome: it does not work correctly with apply.
library(plyr)
# make the dataset more "realistic"
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# someValue is summarized !
# note we have a another, varying category hence we need the a parameter
calcShares <- function(a, data) {
# !is.na needs to be specific!
tempres1 <- eval(substitute(ddply(data[!is.na(a),],.(group,size,a),summarize,
sumTest = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
tempres2 <- eval(substitute(ddply(data[!is.na(a),],.(group,size),summarize,
sumTestTotal = sum(someValue,na.rm=T))),
envir=data, enclos=parent.frame())
res <- merge(tempres1,tempres2,by=c("group","size"))
res$share <- res$sumTest/res$sumTestTotal
return(res)
}
test <- calcShares(category,mydata)
test2 <- calcShares(categoryA,mydata)
head(test)
head(test2)
As you can see I intend to run this over different categorical variables. In the example I have only two (category, categoryA) but in fact I got more, so using apply with my function would be really nice, but somehow it does not work correctly.
applytest <- head(apply(mydata[grep("^cat",
names(mydata),value=T)],2,calcShares,data=mydata))
.. returns a warning message and a strange name (newX[, i] ) for the category var.
So how can I do THIS a) more elegantly and b) fix the apply issue?
This seems simple, so I may be missing some aspect of your question.
First, define a function that calculates the values you want inside each level of group. Then, instead of using .(group, size) to split the data.frame, use .(group), and apply the newly defined function to each of the split pieces.
library(plyr)
# Create a dataset with the names in your example
mydata <- warpbreaks
names(mydata) <- c("someValue", "group", "size")
# A function that calculates the proportional contribution of each size class
# to the sum of someValue within a level of group
getProps <- function(df) {
with(df, ave(someValue, size, FUN=sum)/sum(someValue))
}
# The call to ddply()
res <- ddply(mydata, .(group),
.fun = function(X) transform(X, PROPS=getProps(X)))
head(res, 12)
# someValue group size PROPS
# 1 26 A L 0.4785203
# 2 30 A L 0.4785203
# 3 54 A L 0.4785203
# 4 25 A L 0.4785203
# 5 70 A L 0.4785203
# 6 52 A L 0.4785203
# 7 51 A L 0.4785203
# 8 26 A L 0.4785203
# 9 67 A L 0.4785203
# 10 18 A M 0.2577566
# 11 21 A M 0.2577566
# 12 29 A M 0.2577566