dplyr summarise() and summarise_each() make extra calls to the provided functions - r

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?
(This is using R version 3.3.0 and dplyr version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)

Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]

Related

What is the tidyverse way of passing columns into functions? [duplicate]

This question already has answers here:
Using column names as function arguments
(4 answers)
Closed 6 months ago.
I am working on a function to perform PCA on a dataset, and I wanted to write a function to do the same stuff on different columns. However, I'm having a hard time doing so because I can't seem to make the function understand that I'm passing through columns. As an example:
perform_pca <- function(columns_to_exclude = c()) {
pca <- data %>%
select(-column_to_exclude) %>%
other_stuff() %>%
prcomp()
pvar_pve <- tibble(
p.var = pca$sdev ^ 2 / sum(pca$sdev ^ 2),
pve = cumsum(p.var),
row_id = seq(1, length(pca) - length(columns_to_exclude))
)
ggplot(pvar_pve, ...other things)
}
However, doing afterwards
perform_pca(c(data$column1, data$column2, whatever_else))
only works if I call it without arguments. If I pass it one or more columns, it gives me an error message about the tibble length.
Put another way, what is the correct way of passing tibble columns into functions so that dplyr recognizes them as such? For example
test <- function(columns) {
data %>%
select(columns)
}
test(c(var1,var2))
would return an error. What's the correct way to actually do this?
You can do it without curly brackets just by using ... to pass to select and passing column names separately:
library(tidyverse)
data <- tibble(
a = 1:10,
b = rnorm(10),
c = letters[1:10],
d = 21:30
)
test <- function(data, ...) {
data %>%
select(-c(...))
}
test(data, a, b)
#> # A tibble: 10 × 2
#> c d
#> <chr> <int>
#> 1 a 21
#> 2 b 22
#> 3 c 23
#> 4 d 24
#> 5 e 25
#> 6 f 26
#> 7 g 27
#> 8 h 28
#> 9 i 29
#> 10 j 30
See here for info on this and other ways of doing things with tidy evaluation. The benefits of doing it this way and also using data as your first argument is that you can pipe your dataframe into the function and it will use 'tidyselect' to suggest variables to include as arguments to the function from inside your dataframe environment.
You can do it with passing a vector of columns, which is where curly brackets are needed:
test <- function(data, vars) {
data %>%
select(-c({{vars}}))
}
test(data, c(a, b))

How to define a variable to record the number of processed rows when using R, dplyr and rowwise?

I have a function which needs a long time to run. So, I want to know how many rows of my data frame are processed. Usually, we can define a variable in for loop to deal with this easily. But I do not know how to do it in dplyr.
Let's say the code is:
library(tidyverse)
myFUN <-functin (x) {
x + 1
}
a <- tibble(id=c(1:3),x=c(3,5,1))
a1 <- a %>%
rowwise() %>%
mutate(y=myFUN(x))
I hope in somewhere the code, I can define a variable i. The value will be plus 1 every time one row is processed, then print its values in console like:
1
2
3
Can you pass another variable to the function which would be the row number of the dataframe and print it in the function. Something like :
myFUN <-function (x, y) {
message(y)
x + 1
}
and then use
library(dplyr)
a %>% mutate(y = purrr::map2_dbl(x, row_number(), myFUN))
#1
#2
#3
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 1 3 4
#2 2 5 6
#3 3 1 2
If your function is vectorized, you can let go map_dbl and do
a %>% mutate(y= myFUN(x, seq_len(n())))

dplyr summarise evaluates custom function twice? [duplicate]

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?
(This is using R version 3.3.0 and dplyr version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)
Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]

Mutating columns of a data frame based on a predicate function (dplyr::mutate_if)

I would like to use dplyr's mutate_if() function to convert list-columns to data-frame-columns, but run into a puzzling error when I try to do so. I am using dplyr 0.5.0, purrr 0.2.2, R 3.3.0.
The basic setup looks like this: I have a data frame d, some of whose columns are lists:
d <- dplyr::data_frame(
A = list(
list(list(x = "a", y = 1), list(x = "b", y = 2)),
list(list(x = "c", y = 3), list(x = "d", y = 4))
),
B = LETTERS[1:2]
)
I would like to convert the column of lists (in this case, d$A) to a column of data frames using the following function:
tblfy <- function(x) {
x %>%
purrr::transpose() %>%
purrr::simplify_all() %>%
dplyr::as_data_frame()
}
That is, I would like the list-column d$A to be replaced by the list lapply(d$A, tblfy), which is
[[1]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 a 1
2 b 2
[[2]]
# A tibble: 2 x 2
x y
<chr> <dbl>
1 c 3
2 d 4
Of course, in this simple case, I could just do a simple reassignment. The point, however, is that I would like to do this programmatically, ideally with dplyr, in a generally applicable way that could deal with any number of list-columns.
Here's where I stumble: When I try to convert the list-columns to data-frame-columns using the following application
d %>% dplyr::mutate_if(is.list, funs(tblfy))
I get an error message that I don't know how to interpret:
Error: Each variable must be named.
Problem variables: 1, 2
Why does mutate_if() fail? How can I properly apply it to get the desired result?
Remark
A commenter has pointed out that the function tblfy() should be vectorized. That is a reasonable suggestion. But — unless I have vectorized incorrectly — that does not seem to get at the root of the problem. Plugging in a vectorized version of tblfy(),
tblfy_vec <- Vectorize(tblfy)
into mutate_if() fails with the error
Error: wrong result size (4), expected 2 or 1
Update
After gaining some experience with purrr, I now find the following approach natural, if somewhat long-winded:
d %>%
map_if(is.list, ~ map(., ~ map_df(., identity))) %>%
as_data_frame()
This is more or less identical to #alistaire's solution, below, but uses map_if(), resp. map(), in place of mutate_if(), resp. Vectorize().
The original tblfy function errors out for me (even when its elements are chained directly), so let's rebuild it a bit, adding vectorization as well, which lets us avoid an otherwise-necessary prior rowwise() call:
tblfy <- Vectorize(function(x){x %>% purrr::map_df(identity) %>% list()})
Now we can use mutate_if nicely:
d %>% mutate_if(purrr::is_list, tblfy)
## Source: local data frame [2 x 2]
##
## A B
## <list> <chr>
## 1 <tbl_df [2,2]> A
## 2 <tbl_df [2,2]> B
...and if we unnest to see what's there,
d %>% mutate_if(purrr::is_list, tblfy) %>% tidyr::unnest()
## Source: local data frame [4 x 3]
##
## B x y
## <chr> <chr> <dbl>
## 1 A a 1
## 2 A b 2
## 3 B c 3
## 4 B d 4
A couple notes:
map_df(identity) seems to be more efficient at building a tibble than any of the alternative formulations. I know the identity call seems unnecessary, but most everything else breaks.
I'm not sure how widely useful tblfy will be, as it's somewhat dependent on the structure of the lists in the list column, which can vary enormously. If you have a lot with a similar structure, I suppose it's useful, though.
There may be a way to do this with pmap instead of Vectorize, but I can't get it to work with some cursory tries.
In-place conversion without any copying:
library(data.table)
for (col in d) if (is.list(col)) lapply(col, setDF)
d
#Source: local data frame [2 x 2]
#
# A B
#1 <S3:data.frame> A
#2 <S3:data.frame> B

Remove last N rows in data frame with the arbitrary number of rows

I have a data frame and I want to remove last N rows from it.
If I want to remove 5 rows, I currently use the following command, which in my opinion is rather convoluted:
df<- df[-seq(nrow(df),nrow(df)-4),]
How would you accomplish task, is there a convenient function that I can use in R?
In unix, I would use:
tac file | sed '1,5d' | tac
head with a negative index is convenient for this...
df <- data.frame( a = 1:10 )
head(df,-5)
# a
#1 1
#2 2
#3 3
#4 4
#5 5
p.s. your seq() example may be written slightly less(?) awkwardly using the named arguments by and length.out (shortened to len) like this -seq(nrow(df),by=-1,len=5).
This one takes one more line, but is far more readable:
n<-dim(df)[1]
df<-df[1:(n-5),]
Of course, you can do it in one line by sticking the dim command directly into the re-assignment statement.
I assume this is part of a reproducible script, and you can retrace your steps... Otherwise, strongly recommend in such cases to save to a different variable (e.g., df2) and then remove the redundant copy only after you're sure you got what you wanted.
Adding a dplyr answer for completeness:
test_df <- data_frame(a = c(1,2,3,4,5,6,7,8,9,10),
b = c("a","b","c","d","e","f","g","h","i","j"))
slice(test_df, 1:(n()-5))
## A tibble: 5 x 2
# a b
# <dbl> <chr>
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
Another dplyr answer which is even more readable:
df %>% filter(row_number() <= n()-5)

Resources