Is there any exception handeling mechanism in dplyr's mutate()? What I mean is a way to catch exceptions and handle them.
Let us suppose that I have a function that throws an error in some cases (in the example if the input is negative), for the sake of simplicity I define the function, but in real life it will be a function in some R package. Let us suppose this function is vectorized:
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
Now, let's suppose I want to use this function inside mutate().
If this function is used inside a mutate(), it stops at the first error and no result is returned:
library(dplyr)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
data %>% mutate(y = my_func_vect(x))
# Error in mutate_impl(.data, dots) : Evaluation error: x must be positive.
Is there a way to catch the error, and do something (e.g. return an NA) in this case, while getting results for the other elements?
The result I expect is what would be achieved using a loop with tryCatch(), i.e. something along the lines of:
y <- rep(NA_real_, length(data$x))
for(i in seq_along(data$x)) {
tryCatch({
y[i] <- my_func_vect(data$x[i])
}, error = function(err){})
}
y
# Result is: 1 NA 2 4
We can also make use of purrr's safely() or possibly() functions.
From the purrr help:
safely: wrapped function instead returns a list with components result and error. One value is always NULL.
quietly: wrapped function instead returns a list with components result, output, messages and warnings.
possibly: wrapped function uses a default value (otherwise) whenever an error occurs.
It doesn't change the fact that you have to apply the function to each row separately.
library(dplyr)
library(purrr)
# function throwing an error
my_func <- function(x){
if(x > 0) return(sqrt(x))
stop('x must be positive')
}
my_func_vect <- Vectorize(my_func)
# dummy data
data <- data.frame(x = c(1, -1, 4, 9))
With map:
data %>%
mutate(y = map_dbl(x, ~possibly(my_func_vect, otherwise = NA_real_)(.x)))
#> x y
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
Using rowwise():
data %>%
rowwise() %>%
mutate(y = possibly(my_func_vect, otherwise = NA_real_)(x))
#> Source: local data frame [4 x 2]
#> Groups: <by row>
#>
#> # A tibble: 4 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 -1 NA
#> 3 4 2
#> 4 9 3
The others functions are somewhat more difficult to use and apply in a 'data-frame environment', as they are more suited to work with lists, and returns such.
Created on 2018-05-15 by the reprex package (v0.2.0).
You want to evaluate every occuring error individually, maybe you shouldn't use the vectorized function. Instead use map from the purrr package- which is effectively the same as lapply here.
Make a function to catch the error for standard use if you want NA values in the case you get an error.
try_my_func <- function(x) {
tryCatch(my_func(x), error = function(err){NA})
}
Then use mutate with map
data %>% mutate(y = purrr::map(x, try_my_func))
x y
1 1 1
2 -1 NA
3 4 2
4 9 3
Or similarly, if you don't want to declare a new function.
data %>% mutate(y = purrr::map(x, ~ tryCatch(my_func(.), error = function(err){NA})))
And lastly if you Do want to use a Vectorized function, you can skip the map function altogether. But personally I never use Vectorize so I'd do it with map.
data %>% mutate(y = Vectorize(try_my_func)(x))
Related
This question already has answers here:
Using column names as function arguments
(4 answers)
Closed 6 months ago.
I am working on a function to perform PCA on a dataset, and I wanted to write a function to do the same stuff on different columns. However, I'm having a hard time doing so because I can't seem to make the function understand that I'm passing through columns. As an example:
perform_pca <- function(columns_to_exclude = c()) {
pca <- data %>%
select(-column_to_exclude) %>%
other_stuff() %>%
prcomp()
pvar_pve <- tibble(
p.var = pca$sdev ^ 2 / sum(pca$sdev ^ 2),
pve = cumsum(p.var),
row_id = seq(1, length(pca) - length(columns_to_exclude))
)
ggplot(pvar_pve, ...other things)
}
However, doing afterwards
perform_pca(c(data$column1, data$column2, whatever_else))
only works if I call it without arguments. If I pass it one or more columns, it gives me an error message about the tibble length.
Put another way, what is the correct way of passing tibble columns into functions so that dplyr recognizes them as such? For example
test <- function(columns) {
data %>%
select(columns)
}
test(c(var1,var2))
would return an error. What's the correct way to actually do this?
You can do it without curly brackets just by using ... to pass to select and passing column names separately:
library(tidyverse)
data <- tibble(
a = 1:10,
b = rnorm(10),
c = letters[1:10],
d = 21:30
)
test <- function(data, ...) {
data %>%
select(-c(...))
}
test(data, a, b)
#> # A tibble: 10 × 2
#> c d
#> <chr> <int>
#> 1 a 21
#> 2 b 22
#> 3 c 23
#> 4 d 24
#> 5 e 25
#> 6 f 26
#> 7 g 27
#> 8 h 28
#> 9 i 29
#> 10 j 30
See here for info on this and other ways of doing things with tidy evaluation. The benefits of doing it this way and also using data as your first argument is that you can pipe your dataframe into the function and it will use 'tidyselect' to suggest variables to include as arguments to the function from inside your dataframe environment.
You can do it with passing a vector of columns, which is where curly brackets are needed:
test <- function(data, vars) {
data %>%
select(-c({{vars}}))
}
test(data, c(a, b))
I am changing the values of a column of a data frame. Then, I am saving the file, supposedly with the changes, but not. What am I missing? Thanks,
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
that was oversimplified, the real deal is this one:
date_format_1 = "[0-9]-[:alpha:][:alpha:][:alpha:]"
date_format_2 = "[:alpha:][:alpha:][:alpha:]-[0-9][0-9]"
test <- data.frame(name_s = c("v","w","x","y","z"), event_text = c("Aug-89","7-May","9-Jun","4-Dec-2021","Feb-99"))
lapply(1:length(test$event_text), function(x) {
if (str_detect(test$event_text[[x]], paste0("\\b",date_format_1,"\\b")) == T){
test$event_text[x] <- paste0(str_sub(test$event_text[[x]],1,1), "/F",
which(month.abb %in% str_sub(
test$event_text[[x]], 3,5)))
} else if(str_detect(test$event_text[[x]], paste0("\\b", date_format_2,"\\b"))
== T) {
test$event_text[[x]] = paste0(which(month.abb %in% str_sub(
test$event_text[x],1,3)),"/F",str_sub(test$event_text[[x]],-2))
} else {
test$event_text[x] <- test$event_text[[x]]
}
})
write.csv(test,paste0("test ",format(Sys.time(),"%Y%m%d"),".csv"),
row.names = F)
Below I have written two calls to lapply that fix the issue you were having. The problem stems from the fact that R has scoped variables and so the value is changed within the function but the result is never returned or extracted from the function. As such I have demonstrated this by printing the dataframe after each of the lapply() calls below.
We can fix this in two ways. The first more correct version is to let lapply modify the exact vector directly by adding one to each value and returning x+1. (Note I have skipped curly braces and this will return the value from the next ppiece of code run, in this case x+1 alternatively you could write function(x) {return(x+1)} in that argument).
An alternate approach that will run slower but still use the indexing method is to use global assignment. <<- assigns the variable to the global scope/environment rather than the local scope of the function. (Note this code is run sequentially so the written call to this function is adding + 1 for the second time to the dataframe when shown below).
test <- data.frame(name_s = c("x","y","z"), number_s = c(1,2,3))
# Original Behaviour, doesn't work due to scoping issues
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 2
#>
#> [[2]]
#> [1] 3
#>
#> [[3]]
#> [1] 4
print(test)
#> name_s number_s
#> 1 x 1
#> 2 y 2
#> 3 z 3
# function that is syntactically and functionally correct
# instead of modifying the vector in the function scope the function returns the
# mutated vector which we then assign to the dataframe's vector
test$number_s <- lapply(test$number_s, function(x) x + 1)
print(test)
#> name_s number_s
#> 1 x 2
#> 2 y 3
#> 3 z 4
# function that is syntactically odd but functionally correct
# the function affects the values in the global scope, this works but is slower
# and is not best practice as it would be difficult to read
lapply(1:length(test$number_s), function(x) {
test$number_s[x] <<- test$number_s[[x]] + 1
})
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#> [1] 5
print(test)
#> name_s number_s
#> 1 x 3
#> 2 y 4
#> 3 z 5
Created on 2021-07-23 by the reprex package (v2.0.0)
I'm experimenting with using functions in dataframes (tidyverse tibbles) in R and I ran into some difficulties. The following is a minimal (trivial) example of my problem.
Suppose I have a function that takes in three arguments: x and y are numbers, and f is a function. It performs f(x) + y and returns the output:
func_then_add = function(x, y, f) {
result = f(x) + y
return(result)
}
And I have some simple functions it might use as f:
squarer = function(x) {
result = x^2
return(result)
}
cuber = function(x) {
result = x^3
return(result)
}
Done on its own, func_then_add works as advertised:
> func_then_add(5, 2, squarer)
[1] 27
> func_then_add(6, 11, cuber)
[1] 227
But lets say I have a dataframe (tidyverse tibble) with two columns for the numeric arguments, and one column for which function I want:
library(tidyverse)
library(magrittr)
test_frame = tribble(
~arg_1, ~arg_2, ~func,
5, 2, squarer,
6, 11, cuber
)
> test_frame
# A tibble: 2 x 3
arg_1 arg_2 func
<dbl> <dbl> <list>
1 5 2 <fn>
2 6 11 <fn>
I then want to make another column result that is equal to func_then_add applied to those three columns. It should be 27 and 227 like before. But when I try this, I get an error:
> test_frame %>% mutate(result=func_then_add(.$arg_1, .$arg_2, .$func))
Error in f(x) : could not find function "f"
Why does this happen, and how do I get what I want properly? I confess that I'm new to "functional programming", so maybe I'm just making an obvious syntax error ...
Not the most elegant but we can do:
test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1, .$arg_2, .$func[[x]])))
EDIT: The above maps both over the entire data which isn't really what OP desires. As suggested by #January this can be better applied as:
Result <- test_frame %>%
mutate(Res= map(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
Result$Res
The above again is not very efficient since it returns a list. A better alternative(again as suggested by #January is to use map_dbl which returns the same data type as its objects:
test_frame %>%
mutate(Res= map_dbl(seq_along(.$func), function(x)
func_then_add(.$arg_1[x], .$arg_2[x], .$func[[x]])))
# A tibble: 2 x 4
arg_1 arg_2 func Res
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
This is because you should map instead of mutating. Mutate calls the function once, and supplies the whole columns as arguments.
The second problem is that test_frame$func[1] is not a function, but a list with one element. You can't have "function" columns, only list columns.
Try this:
test_frame$result <- with(test_frame,
map_dbl(1:2, ~ func_then_add(arg_1[.], arg_2[.], func[[.]])))
Result:
# A tibble: 2 x 4
arg_1 arg_2 func result
<dbl> <dbl> <list> <dbl>
1 5 2 <fn> 27
2 6 11 <fn> 227
EDIT: a simpler solution using dplyr, mutate and rowwise:
test_frame %>% rowwise %>% mutate(res=func_then_add(arg_1, arg_2, func))
Quite frankly, I am slightly puzzled by this last one. Why func and not func[[1]]? func should be a list, and not function. mutate and rowwise are doing here something sinister, like automatically converting a list to a vector.
Edit 2: actually, this is written explicitly in the rowwise manual:
Its main impact is to allow you to work with list-variables in
‘summarise()’ and ‘mutate()’ without having to use ‘[[1]]’.
Final edit: I became so fixated on tidyverse recently that I did not think of the simplest option – using base R:
apply(test_frame, 1, function(x) func_then_add(x$arg_1, x$arg_2, x$func))
I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5
I would like to use dplyr's mutate_if() function to convert list-columns to data-frame-columns, but run into a puzzling error when I try to do so. I am using dplyr 0.5.0, purrr 0.2.2, R 3.3.0.
The basic setup looks like this: I have a data frame d, some of whose columns are lists:
d <- dplyr::data_frame(
A = list(
list(list(x = "a", y = 1), list(x = "b", y = 2)),
list(list(x = "c", y = 3), list(x = "d", y = 4))
),
B = LETTERS[1:2]
)
I would like to convert the column of lists (in this case, d$A) to a column of data frames using the following function:
tblfy <- function(x) {
x %>%
purrr::transpose() %>%
purrr::simplify_all() %>%
dplyr::as_data_frame()
}
That is, I would like the list-column d$A to be replaced by the list lapply(d$A, tblfy), which is
[[1]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 a 1
2 b 2
[[2]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 c 3
2 d 4
Of course, in this simple case, I could just do a simple reassignment. The point, however, is that I would like to do this programmatically, ideally with dplyr, in a generally applicable way that could deal with any number of list-columns.
Here's where I stumble: When I try to convert the list-columns to data-frame-columns using the following application
d %>% dplyr::mutate_if(is.list, funs(tblfy))
I get an error message that I don't know how to interpret:
Error: Each variable must be named.
Problem variables: 1, 2
Why does mutate_if() fail? How can I properly apply it to get the desired result?
Remark
A commenter has pointed out that the function tblfy() should be vectorized. That is a reasonable suggestion. But — unless I have vectorized incorrectly — that does not seem to get at the root of the problem. Plugging in a vectorized version of tblfy(),
tblfy_vec <- Vectorize(tblfy)
into mutate_if() fails with the error
Error: wrong result size (4), expected 2 or 1
Update
After gaining some experience with purrr, I now find the following approach natural, if somewhat long-winded:
d %>%
map_if(is.list, ~ map(., ~ map_df(., identity))) %>%
as_data_frame()
This is more or less identical to #alistaire's solution, below, but uses map_if(), resp. map(), in place of mutate_if(), resp. Vectorize().
The original tblfy function errors out for me (even when its elements are chained directly), so let's rebuild it a bit, adding vectorization as well, which lets us avoid an otherwise-necessary prior rowwise() call:
tblfy <- Vectorize(function(x){x %>% purrr::map_df(identity) %>% list()})
Now we can use mutate_if nicely:
d %>% mutate_if(purrr::is_list, tblfy)
## Source: local data frame [2 x 2]
##
## A B
## <list> <chr>
## 1 <tbl_df [2,2]> A
## 2 <tbl_df [2,2]> B
...and if we unnest to see what's there,
d %>% mutate_if(purrr::is_list, tblfy) %>% tidyr::unnest()
## Source: local data frame [4 x 3]
##
## B x y
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 B c 3
## 4 B d 4
A couple notes:
map_df(identity) seems to be more efficient at building a tibble than any of the alternative formulations. I know the identity call seems unnecessary, but most everything else breaks.
I'm not sure how widely useful tblfy will be, as it's somewhat dependent on the structure of the lists in the list column, which can vary enormously. If you have a lot with a similar structure, I suppose it's useful, though.
There may be a way to do this with pmap instead of Vectorize, but I can't get it to work with some cursory tries.
In-place conversion without any copying:
library(data.table)
for (col in d) if (is.list(col)) lapply(col, setDF)
d
#Source: local data frame [2 x 2]
#
# A B
#1 <S3:data.frame> A
#2 <S3:data.frame> B