dplyr summarise evaluates custom function twice? [duplicate] - r

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?
(This is using R version 3.3.0 and dplyr version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)

Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]

Related

What is the tidyverse way of passing columns into functions? [duplicate]

This question already has answers here:
Using column names as function arguments
(4 answers)
Closed 6 months ago.
I am working on a function to perform PCA on a dataset, and I wanted to write a function to do the same stuff on different columns. However, I'm having a hard time doing so because I can't seem to make the function understand that I'm passing through columns. As an example:
perform_pca <- function(columns_to_exclude = c()) {
pca <- data %>%
select(-column_to_exclude) %>%
other_stuff() %>%
prcomp()
pvar_pve <- tibble(
p.var = pca$sdev ^ 2 / sum(pca$sdev ^ 2),
pve = cumsum(p.var),
row_id = seq(1, length(pca) - length(columns_to_exclude))
)
ggplot(pvar_pve, ...other things)
}
However, doing afterwards
perform_pca(c(data$column1, data$column2, whatever_else))
only works if I call it without arguments. If I pass it one or more columns, it gives me an error message about the tibble length.
Put another way, what is the correct way of passing tibble columns into functions so that dplyr recognizes them as such? For example
test <- function(columns) {
data %>%
select(columns)
}
test(c(var1,var2))
would return an error. What's the correct way to actually do this?
You can do it without curly brackets just by using ... to pass to select and passing column names separately:
library(tidyverse)
data <- tibble(
a = 1:10,
b = rnorm(10),
c = letters[1:10],
d = 21:30
)
test <- function(data, ...) {
data %>%
select(-c(...))
}
test(data, a, b)
#> # A tibble: 10 × 2
#> c d
#> <chr> <int>
#> 1 a 21
#> 2 b 22
#> 3 c 23
#> 4 d 24
#> 5 e 25
#> 6 f 26
#> 7 g 27
#> 8 h 28
#> 9 i 29
#> 10 j 30
See here for info on this and other ways of doing things with tidy evaluation. The benefits of doing it this way and also using data as your first argument is that you can pipe your dataframe into the function and it will use 'tidyselect' to suggest variables to include as arguments to the function from inside your dataframe environment.
You can do it with passing a vector of columns, which is where curly brackets are needed:
test <- function(data, vars) {
data %>%
select(-c({{vars}}))
}
test(data, c(a, b))

How to define a variable to record the number of processed rows when using R, dplyr and rowwise?

I have a function which needs a long time to run. So, I want to know how many rows of my data frame are processed. Usually, we can define a variable in for loop to deal with this easily. But I do not know how to do it in dplyr.
Let's say the code is:
library(tidyverse)
myFUN <-functin (x) {
x + 1
}
a <- tibble(id=c(1:3),x=c(3,5,1))
a1 <- a %>%
rowwise() %>%
mutate(y=myFUN(x))
I hope in somewhere the code, I can define a variable i. The value will be plus 1 every time one row is processed, then print its values in console like:
1
2
3
Can you pass another variable to the function which would be the row number of the dataframe and print it in the function. Something like :
myFUN <-function (x, y) {
message(y)
x + 1
}
and then use
library(dplyr)
a %>% mutate(y = purrr::map2_dbl(x, row_number(), myFUN))
#1
#2
#3
# A tibble: 3 x 3
# id x y
# <int> <dbl> <dbl>
#1 1 3 4
#2 2 5 6
#3 3 1 2
If your function is vectorized, you can let go map_dbl and do
a %>% mutate(y= myFUN(x, seq_len(n())))

R: change one value every row in big dataframe

I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2

dplyr summarise() and summarise_each() make extra calls to the provided functions

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?
(This is using R version 3.3.0 and dplyr version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)
Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]

Remove last N rows in data frame with the arbitrary number of rows

I have a data frame and I want to remove last N rows from it.
If I want to remove 5 rows, I currently use the following command, which in my opinion is rather convoluted:
df<- df[-seq(nrow(df),nrow(df)-4),]
How would you accomplish task, is there a convenient function that I can use in R?
In unix, I would use:
tac file | sed '1,5d' | tac
head with a negative index is convenient for this...
df <- data.frame( a = 1:10 )
head(df,-5)
# a
#1 1
#2 2
#3 3
#4 4
#5 5
p.s. your seq() example may be written slightly less(?) awkwardly using the named arguments by and length.out (shortened to len) like this -seq(nrow(df),by=-1,len=5).
This one takes one more line, but is far more readable:
n<-dim(df)[1]
df<-df[1:(n-5),]
Of course, you can do it in one line by sticking the dim command directly into the re-assignment statement.
I assume this is part of a reproducible script, and you can retrace your steps... Otherwise, strongly recommend in such cases to save to a different variable (e.g., df2) and then remove the redundant copy only after you're sure you got what you wanted.
Adding a dplyr answer for completeness:
test_df <- data_frame(a = c(1,2,3,4,5,6,7,8,9,10),
b = c("a","b","c","d","e","f","g","h","i","j"))
slice(test_df, 1:(n()-5))
## A tibble: 5 x 2
# a b
# <dbl> <chr>
#1 1 a
#2 2 b
#3 3 c
#4 4 d
#5 5 e
Another dplyr answer which is even more readable:
df %>% filter(row_number() <= n()-5)

Resources