I have a simple data frame with 3 columns: name, goal, and actual.
Because this is a simplification of much larger dataframe, I want to use dplyr to compute the number of times a goal has been met by each person.
df <- data.frame(name = c(rep('Fred', 3), rep('Sally', 4)),
goal = c(4,6,5,7,3,8,5), actual=c(4,5,5,3,3,6,4))
The result should look like this:
I should be able to pass an anonymous function similar to what is shown below, but don't have the syntax quite right:
library(dplyr)
g <- group_by(df, name)
summ <- summarise(g, met_goal = sum((function(x,y) {
if(x>y){return(0)}
else{return(1)}
})(goal, actual)
)
)
When I run the code above, I see 3 of these errors:
Warning messages:
1: In if (x == y) { :
the condition has length > 1 and only the first element will be used
We have equal length vectors in goal and actual, so the relational operators are appropriate to use here. However, when we use them in a simple if() statement we may get unexpected results because if() expects length 1 vectors. Since we have equal length vectors and we require a binary result, taking the sum of the logical vector is the best approach, as follows.
group_by(df, name) %>%
summarise(met_goal = sum(goal <= actual))
# A tibble: 2 x 2
name met_goal
<fctr> <int>
1 Fred 2
2 Sally 1
The operator is switched to <= because you want 0 for goal > actual and 1 otherwise.
Note that you can use an anonymous function. It was the if() statement that was throwing you off. For example, using
sum((function(x, y) x <= y)(goal, actual))
would work in the manner you are asking about.
Solution using data.table:
You asked for dplyr solution, but as actual data is much larger you can use data.table. foo is function you want to apply.
foo <- function(x, y) {
res <- 0
if (x <= y) {
res <- 1
}
return(res)
}
library(data.table)
setDT(df)
setkey(df, name)[, foo(goal, actual), .(name, 1:nrow(df))][, sum(V1), name]
If you prefer pipes then you can use this:
library(magrittr)
setDT(df) %>%
setkey(name) %>%
.[, foo(goal, actual), .(name, 1:nrow(.))] %>%
.[, .(met_goal = sum(V1)), name]
name met_goal
1: Fred 2
2: Sally 1
Found myself needing to do something similar to this again (a year later) but with a more complex function than the simple one provided in the original question. The originally accepted answer took advantage of a specific feature of the problem, but the more general approach was touched on here. Using this approach, the answer I was ultimately after was something like this:
library(dplyr)
df <- data.frame(name = c(rep('Fred', 3), rep('Sally', 4)),
goal = c(4,6,5,7,3,8,5), actual=c(4,5,5,3,3,6,4))
my_func = function(act, goa) {
if(act < goa) {
return(0)
} else {
return(1)
}
}
g <- group_by(df, name)
summ = df %>% group_by(name) %>%
summarise(met_goal = sum(mapply(my_func, .data$actual, .data$goal)))
> summ
# A tibble: 2 x 2
name met_goal
<fct> <dbl>
1 Fred 2
2 Sally 1
The original question referred to using an anonymous function. In that spirit, the last part would look like this:
g <- group_by(df, name)
summ = df %>% group_by(name) %>%
summarise(met_goal = sum(mapply(function(act, go) {
if(act < go) {
return(0)
} else {
return(1)
}
}, .data$actual, .data$goal)))
Related
Sorry for the basic question, but I could not find an example in this forum to solve this question. I've tried this one and this one.
I want to change / create a new variable in my data.frame via function in R and tidyverse:
Example:
crivo <- function(x) {
x <<- x %>%
mutate(resp_1 = if_else(MEMO_RE01 == 0,"VN","FP")) %>%
mutate(resp_2 = if_else(MEMO_RE02 == 1,"VP","FN"))
}
crivo(memo_re)
My data.frame name is "memo_re", but I'll use this function to other datasets as well, just by changing the x argument. R is creating a new data.frame named x instead of creating a new variable in "memor_re" (original dataset). In other words, I want to assign a function to do that:
memo_re <- memo_re %>% mutate(resp_1 = if_else(MEMO_RE01 == 0,"VN","FP"))
But I need to change many datasets and because of that, I want to be able to specify which dataset I'll change.
reproducible code
library(tidyverse)
memo_re <- data.frame(MEMO_RE01=rep(c(0,1),100), MEMO_RE02=c(0,1))
crivo <- function(x) {
x <<- x %>%
mutate(resp_1 = if_else(MEMO_RE01 == 0,"VN","FP")) %>%
mutate(resp_2 = if_else(MEMO_RE02 == 1,"VP","FN"))
}
crivo(memo_re)
R is doing exactly what you've asked it to do. In your crivo function definition, you've written your function to assign the new data frame you've created called x to the R environment. That's what the <<- operator does. After running your code, use ls() to see what's in your environment, then look at x. You'll see everything is there, just as you've asked it to be, including the correctly mutate x dataframe.
> memo_re <- data.frame(MEMO_RE01=rep(c(0,1),100), MEMO_RE02=c(0,1))
>
> crivo <- function(x) {
+ x <<- x %>%
+ mutate(resp_1 = if_else(MEMO_RE01 == 0,"VN","FP")) %>%
+ mutate(resp_2 = if_else(MEMO_RE02 == 1,"VP","FN"))
+ }
> crivo(memo_re)
> ls()
[1] "crivo" "memo_re" "x"
> head(x)
MEMO_RE01 MEMO_RE02 resp_1 resp_2
1 0 0 VN FN
2 1 1 FP VP
3 0 0 VN FN
4 1 1 FP VP
5 0 0 VN FN
6 1 1 FP VP
Now, if you wanted to have crivo() return something that you could then assign any name you wanted, you should use
crivo <- function(x) {
x %>%
mutate(resp_1 = if_else(MEMO_RE01 == 0,"VN","FP"),
resp_2 = if_else(MEMO_RE02 == 1,"VP","FN"))
}
Note that I haven't used the <<- operator anywhere. As a result, the crivo fx will be returning the mutated x dataframe so that you could do
new <- memo_re %>% crivo()
This way, you can pipe anything you want to crivo and assign it to any new variable. Alternatively, if you just wanted to call the function on memo_re, you can do that too:
memo_re <- memo_re %>% crivo()
Note that the "classic" way to write a function is to use return() to specify what you want a fx to return. If you don't use return() (as I haven't above), R will return whatever is in the last line. Here, it's just the mutate dataframe.
I realize this seems close to other questions already asked and it seems answered on https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
but what I show below seems to follow that advise but without success.
Here is the data:
d<-tibble(y=c(1,2,3,NA))
First something that works:
my.f <- function(df,column_var){
df %>%
mutate(z = y) %>%
filter(!is.na(z))
}
my.f(d,quo(z))
Now something that I want to produce the same result, but doesn't work:
my.f <- function(df,column_var){
df %>%
mutate(!!column_var = y) %>%
filter(!is.na(!!column_var))
}
my.f(d,quo(z))
Error: unexpected '=' in:
" df %>%
mutate(!!column_var ="
For setting variable names you'll need a string on the left hand side and := instead of = in mutate.
You can use quo_name for turning z into a string for the column name.
Your function could then look like:
my.f = function(df, column_var) {
column_var = enquo(column_var)
df %>%
mutate(!!quo_name(column_var) := y) %>%
filter( !is.na(!!column_var) )
}
my.f(d, z)
# A tibble: 3 x 2
y z
<dbl> <dbl>
1 1 1
2 2 2
3 3 3
Assuming I have a dataframe, df with this info
group wk source revenue
1 1 C 100
1 1 D 200
1 1 A 300
1 1 B 400
1 2 C 500
1 2 D 600
I'm trying to programatically filter's down to rows of unique combinations of group, wk and source, and then perform some operations on them, before combining them back into another dataframe. I want to write a function that can scale to any number of segments (and not just the example scenario here) and filter down rows. All I need to pass would be the column names by which I want to segment
eg.
seg <- c("group", "wk", "source")
One unique combination to filter rows in df would be
df %>% filter(group == 1 & wk == 1 & source == "A")
I wrote a recursive function (get_rows) to do so, but it doesn't seem to do what I want. Could anyone provide inputs on where I'm going wrong ?
library(dplyr)
filter_row <- function(df,x)
{
df %>% filter(group == x$group & wk == x$wk & source == x$source)
}
seg <- c("group", "wk", "source")
get_rows <- function(df,seg,pos = 1, l = list())
{
while(pos <= (length(seg) + 1))
{
if(pos <= length(seg))
for(j in 1:length(unique(df[,seg[pos]])))
{
k <- unique(df[,seg[pos]])
l[seg[pos]] <- k[j]
get_rows(df,seg,pos+1,l)
return()
}
if(pos > length(seg))
{
tmp <- df %>% filter_row(l)
<call some function on tmp>
return()
}
}
}
get_rows(df,seg)
EDIT: I understand there are prebuilt methods I can use to get what I need, but I'm curious about where I'm going wrong in the recursive function I wrote.
There might be a data.table/dplyr solution out there, but this one is pretty simple.
# Just paste together the values of the column you want to aggregate over.
# This creates a vector of factors
f <- function(data, v) {apply(data[,v,drop=F], 1, paste, collapse = ".")}
# Aggregate, tapply, ave, and a few more functions can do the same thing
by(data = df, # Your data here
INDICES = f(df, c("group", "wk", "source")), # Your data and columns here
FUN = identity, simplify = F) # Your function here
Can also use library(dplyr) and library(data.table)
df %>% data.table %>% group_by(group, wk, source) %>% do(yourfunctionhere, use . for x)
The dplyr::summarize() function can apply arbitrary functions over the data, but it seems that function must return a scalar value. I'm curious if there is a reasonable way to handle functions that return a vector value without making multiple calls to the function.
Here's a somewhat silly minimal example. Consider a function that gives multiple values, such as:
f <- function(x,y){
coef(lm(x ~ y, data.frame(x=x,y=y)))
}
and data that looks like:
df <- data.frame(group=c('A','A','A','A','B','B','B','B','C','C','C','C'), x=rnorm(12,1,1), y=rnorm(12,1,1))
I'd like to do something like:
df %>%
group_by(group) %>%
summarise(f(x,y))
and get back a table that has 2 columns added for each of the returned values instead of the usual 1 column. Instead, this errors with: Expecting single value
Of course we can get multiple values from dlpyr::summarise() by giving the function argument multiple times:
f1 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[1]]
f2 <- function(x,y) coef(lm(x ~ y, data.frame(x=x,y=y)))[[2]]
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
This gives the desired output:
group a b
1 A 1.7957245 -0.339992915
2 B 0.5283379 -0.004325209
3 C 1.0797647 -0.074393457
but coding in this way is ridiculously crude and ugly.
data.table handles this case more succinctly:
dt <- as.data.table(df)
dt[, f(x,y), by="group"]
but creates an output that extend the table using additional rows instead of additional columns, resulting in an output that is both confusing and harder to work with:
group V1
1: A 1.795724536
2: A -0.339992915
3: B 0.528337890
4: B -0.004325209
5: C 1.079764710
6: C -0.074393457
Of course there are more classic apply strategies we could use here,
sapply(levels(df$group), function(x) coef(lm(x~y, df[df$group == x, ])))
A B C
(Intercept) 1.7957245 0.528337890 1.07976471
y -0.3399929 -0.004325209 -0.07439346
but this sacrifices both the elegance and I suspect the speed of the grouping. In particular, note that we cannot use our pre-defined function f in this case, but have to hard code the grouping into the function definition.
Is there a dplyr function for handling this case? If not, is there a more elegant way to handle this process of evaluating vector-valued functions over a data.frame by group?
You could try do
library(dplyr)
df %>%
group_by(group) %>%
do(setNames(data.frame(t(f(.$x, .$y))), letters[1:2]))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
The output based on f1 and f2 are
df %>%
group_by(group) %>%
summarise(a = f1(x,y), b = f2(x,y))
# group a b
#1 A 0.8983217 -0.04108092
#2 B 0.8945354 0.44905220
#3 C 1.2244023 -1.00715248
Update
If you are using data.table, the option to get similar result is
library(data.table)
setnames(setDT(df)[, as.list(f(x,y)) , group], 2:3, c('a', 'b'))[]
This is why I still love plyr::ddply():
library(plyr)
f <- function(z) setNames(coef(lm(x ~ y, z)), c("a", "b"))
ddply(df, ~ group, f)
# group a b
# 1 A 0.5213133 0.04624656
# 2 B 0.3020656 0.01450137
# 3 C 0.2189537 0.22998823
I have grouped data that has blocks of missing values. I used dplyr to compute the sum of my target variable over each group. For groups where the sum is zero, I want to replace that group's values with the ones from the previous group. I could do this in a loop, but since my data is in a large data frame, that would be extremely inefficient.
Here's a synthetic example:
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
Output:
Source: local data frame [8 x 3]
Groups: group
group var total
1 1 1.3697267 4.74936
2 1 1.5263502 4.74936
3 1 0.4065596 4.74936
4 1 1.4467237 4.74936
5 2 NA 0.00000
6 2 NA 0.00000
7 2 NA 0.00000
8 2 NA 0.00000
In this case, I want to replace the values of var in group 2 with the values of var in group 1, and I want to do it by detecting that total = 0 in group 2.
I've tried to come up with a custom function to feed into do() that does this, but can't figure out how to tell it to replace values in the current group with values from a different group. With the above example, I tried the following, which will always replace using the values from group 1:
CheckDay <- function(x) {
if( all(x$total == 0) ) { x$var <- df[df$group==1, 2] } ; x
}
do(df, CheckDay)
CheckDay does return a df, but do() throws an error:
Error: Results are not data frames at positions: 1, 2
Is there a way to get this to work?
There are a couple of things going on. First you need to make sure df is a data.frame, your function CheckDay(x) has both the local variable x which you give value df as the global variable df itself, it's better to keep everything inside the function local. Finally, your call to do(df, CheckDay(.)) is missing the (.) part. Try this, this should work:
library("dplyr")
df <- tbl_df(as.data.frame(cbind(c(rep(1, 4), rep(2, 4)),
c(abs(rnorm(4)), rep(NA, 4)))))
names(df) <- c("group", "var")
df <- df %>%
group_by(group) %>%
mutate(total = sum(var, na.rm = TRUE))
df <- as.data.frame(df)
CheckDay <- function(x) {
if( all( (x[x$group == 2, ])$total == 0) ) {
x$var <- x[x$group == 1, 2]
}
x
}
result <- do(df, CheckDay(.))
print(result)
To expand on Brouwer's answer, here is what I implemented to accomplish my goal:
Generate df as previously.
Create df.shift, a copy of df with groups 1, 1, 2... etc -- i.e. a df with the variables shifted down by one group. (The rows in group 1 of df.shift could also simply be blank.)
Get the indices where total = 0 and copy the values from df.shift into df at those indices.
This can all be done in base R. It creates one copy, but is much cheaper and faster than looping over the groups.