Say we have a data frame,
library(tidyverse)
library(rlang)
df <- tibble(id = rep(c(1:2), 10),
grade = sample(c("A", "B", "C"), 20, replace = TRUE))
we would like to get the mean of grades grouped by id,
df %>%
group_by(id) %>%
summarise(
n = n(),
mu_A = mean(grade == "A"),
mu_B = mean(grade == "B"),
mu_C = mean(grade == "C")
)
I am handling a case where there are multiple conditions (many grades in this case) and would like to make my code more robust. How can we simplify this using tidyevaluation in dplyr 1.0?
I am talking about the idea of generating multiple column names by passing all grades at once, without breaking the flow of piping in dplyr, something like
# how to get the mean of A, B, C all at once?
mu_{grade} := mean(grade == {grade})
I actually found the answer to my own question from a post that I wrote 2 years ago...
I am just going to post the code right below hoping to help anybody that comes across the same problem.
make_expr <- function(x) {
x %>%
map( ~ parse_expr(str_glue("mean(grade == '{.x}')")))
}
# generate multiple expressions
grades <- c("A", "B", "C")
exprs <- grades %>% make_expr() %>% set_names(paste0("mu_", grades))
# we can 'top up' something extra by adding named element
exprs <- c(n = parse_expr("n()"), exprs)
# using the big bang operator `!!!` to force expressions in data frame
df %>% group_by(id) %>% summarise(!!!exprs)
Related
I have a relatively large data file that looks like (a), and need create a structure like (b). Thus I need to calculate the sum of Amount times Coeficient for each ID and each year.
I quickly hacked something together using nested for loops, but thats of course terribly inefficient:
library(tidyverse)
data <- tibble(
id=c("A", "B", "C", "A", "A", "B", "C"),
year=c(2002,2002,2004,2002,2003,2003,2005),
amount=c(1000,1500,1000,500,1000,1000,500),
coef=rep(0.5,7)
)
years <- sort(unique(data$year))
ids <- unique(data$id)
result <- matrix(0,length(ids),length(years)) %>%
as.tibble() %>% setNames(., years)
for (i in seq_along(ids)){
for (j in seq_along(years)){
d <- filter(data, id==ids[i] & year== years[j])
if (nrow(d)!=0){
result[i,j] <- sum(d$amount*d$coef)
}
}
}
result <- add_column(result, ID=ids, .before = 1)
I was wondering how one could solve this efficiently using map(), group_by() or any other tidyverse functions.
Thanks in advance for helpful suggestions.
Here's one way that seems to work. I'm sure there are others.
library(tidyverse)
id <- c("A", "B", "C", "A", "A", "B", "C")
year <- c(2002,2002,2004,2002,2003,2003,2005)
amount <- c(1000,1500,1000,500,1000,1000,500)
coef <- rep(0.5,7)
data <- tibble(id, year, amount, coef)
table <- data %>%
group_by(., id, year) %>%
mutate(prod = amount*coef)%>%
summarize(., sumprod = sum(prod)) %>%
spread(., year, sumprod) %>%
replace(is.na(.), 0)
Thanks for the hint, this really is just one line:
result <- data %>% group_by(id, year) %>% summarise(S=sum(amount*coef)) %>% spread(year, S)
I am trying to create a new data frame with 2 columns: var1 and var2, each one of them is the row sum of specific columns in data frame sampData.
library(dplyr)
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
I know that I can select columns using [] and c(), like this:
sampData[ ,c("A","B")]
but when I try to generate and use that format from my vectors like this:
d1_ <-paste(var1, collapse=",")
d2_ <-paste(var2, collapse=",")
sampData[ ,d1_]
I get this error:
Error in `[.data.frame`(sampData, , d1_) : undefined columns selected
Which I also get if I try to calculate the rowSums -- which is what I am interested in getting.
data.frame(var1 = rowSums(sampData[ , d1_])
, var2 = rowSums(sampData[ , d2_])
I think I have managed to figure out what you are asking, but if I am wrong, let me know.
You are trying to select columns from prep that match the values in l1 and l2, and sum across the rows, limited to the columns that matched each.
It is always better to provide reproducible data, here is some for this case (using dplyr to build it):
sampData <-
rnorm(260) %>%
matrix(ncol = 26) %>%
data.frame() %>%
setNames(LETTERS)
var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")
Then, you don't need to concatenate the column indices at all -- just use the variable (or column, in your case) directly. Here, I have made the ID's letters and will match the letters. However, if your ID's are numeric, it will match that index (e.g., 3 will return the third column).
data.frame(
var1sums = rowSums(sampData[, var1])
, var2sums = rowSums(sampData[, var2])
)
Of note, cat returns NULL after printing to the screen. If you need to concatenate values, you will need to use paste (or similar), but that will not work for what you are trying to do here.
This question got me thinking about flexibility of such solutions, so here is an attempt using dplyr and tidyr, which yields effectively the same result. The difference is that this may provide more flexibility for variable selection or even downstream processing.
sampData %>%
# add column for individual
mutate(ind = 1:nrow(.)) %>%
# convert data to long format
gather("Variable", "Value", -ind) %>%
# Set to group by the individual we added above
group_by(ind) %>%
# Calculate sums as desired
summarise(
var1sums = sum(Value[Variable %in% var1])
, var2sums = sum(Value[Variable %in% var2])
)
However, the real advantage would come if you had an arbitrary number (or just a large number generally) of sets of variables that you wanted to get the individual sums from. Instead of manually constructing every column you might be interested in, you can use standard evaluation (as opposed to non-standard) to automatically generate the columns based on a named list of vectors:
sampData %>%
mutate(ind = 1:nrow(.)) %>%
gather("Variable", "Value", -ind) %>%
group_by(ind) %>%
# Calculate one column for each vector in `varList`
summarise_(
.dots = lapply(varList, function(x){
paste0("sum(Value[Variable %in% c('"
, paste(x, collapse = "', '")
, "')])")
})
)
Sample df:
df <- data.frame(x = c(runif(10,0,2*pi),runif(10,0,360)), group = gl(n = 2, k = 10, labels =c("A","B")))
I want to modify x only for group A (convert it to degrees). With base I just do:
df <- within(df,x[group == "A"] <- x[group == "A"]*180/pi)
I was wondering if there could be a way to do this with dplyr. This is wrong:
df <- df %>% filter(group == "A") %>% mutate(x = x*180/pi)
Because it returns only the subset of df where group == "A". Is there a (simple) way to do this, or is this a case where base trumps dplyr for ease of use?
We can use ifelse to create the logical condition, and based on that we either do the arithmetic calculation or else return the original values.
df %>%
mutate(x = ifelse(group=="A", x*180/pi, x))
Or as #AlexIoannides mentioned, if_else from dplyr can be used so as the type should be taken care of.
In data.table, this can be done by assignment in place and should be more efficient.
library(data.table)
setDT(df)[group=="A", x := x*180/pi]
I would like to be able to use more automation when creating SpatialLines objects from otherwise tidy data frames.
library(sp)
#create sample data
sample_data <- data.frame(group_id = rep(c("a", "b","c"), 10),
x = rnorm(10),
y = rnorm(10))
#How can I recreate this using dplyr?
a_list <- Lines(list(Line(sample_data %>% filter(group_id == "a") %>% select(x, y))), ID = 1)
b_list <- Lines(Line(list(sample_data %>% filter(group_id == "b") %>% select(x, y))), ID = 2)
c_list <- Lines(Line(list(sample_data %>% filter(group_id == "c") %>% select(x, y))), ID = 3)
SpatialLines(list(a_list, b_list, c_list))
You can see how using something like group_by would make the process pretty easy if you could understand how the data could be piped into a list.
Using your sample data, a wrapper function, and dplyr::do will give you what you want :)
wrapper <- function(df) {
df %>% select(x,y) %>% as.data.frame %>% Line %>% list %>% return
}
y <- sample_data %>% group_by(group_id) %>%
do(res = wrapper(.))
# and now assign IDs (since we can't do that inside dplyr easily)
ids = 1:dim(y)[1]
SpatialLines(
mapply(x = y$res, ids = ids, FUN = function(x,ids) {Lines(x,ID=ids)})
)
I don't use sp so there might be a better way to assign IDs.
For reference, consider reading Hadley's comments on returning non-dataframe from dplyr do calls
I have seen a couple of posts of how to write one's own function with dplyr functions. For example, you can see how you can use group_by (regroup) and summarise in this post. I thought that it would be interesting to see if I can write a function using major dplyr functions. My hope is that we can further understand how to write functions using dplyr functions.
DATA
country <- rep(c("UK", "France"), each = 5)
id <- rep(letters[1:5], times = 2)
value <- runif(10, 50, 100)
foo <- data.frame(country, id, value, stringsAsFactors = FALSE)
GOAL
I wanted to write the following process in a function.
foo %>%
mutate(new = ifelse(value > 60, 1, 0)) %>%
filter(id %in% c("a", "b", "d")) %>%
group_by(country) %>%
summarize(whatever = sum(value))
TRY
### Here is a function which does the same process
myFun <- function(x, ana, bob, cathy) x %>%
mutate(new = ifelse(ana > 60, 1, 0)) %>%
filter(bob %in% c("a", "b", "d")) %>%
regroup(as.list(cathy)) %>%
summarize(whatever = sum(ana))
myFun(foo, value, id, "country")
Source: local data frame [2 x 2]
country whatever
1 France 233.1384
2 UK 245.5400
You may realise that arrange() is not there. This is the one I am struggling. Here are two observations. The first experiment was successful. The order of the countries changed from UK-France to France-UK. But the second experiment was not successful.
### Experiment 1: This works for arrange()
myFun <- function(x, ana) x %>%
arrange(ana)
myFun(foo, country)
country id value
1 France a 90.12723
2 France b 86.64229
3 France c 74.93320
4 France d 80.69495
5 France e 72.60077
6 UK a 84.28033
7 UK b 67.01209
8 UK c 94.24756
9 UK d 79.49848
10 UK e 63.51265
### Experiment2: This was not successful.
myFun <- function(x, ana, bob) x %>%
filter(ana %in% c("a", "b", "d")) %>%
arrange(bob)
myFun(foo, id, country)
Error: incorrect size (10), expecting :6
### This works, by the way.
foo %>%
filter(id %in% c("a", "b", "d")) %>%
arrange(country)
Given the first experiment was successful, I have a hard time to understand why the second experiment failed. There may be something one has to do in the 2nd experimentDoes anybody have an idea? Thank you for taking your time.
I installed dplyr 0.3 and lazyeval once issue 352 was closed to see how it might work to use dplyr functions in another function. After reading the vignette on non-standard evaluation, it looks like interp from lazyeval combined with the new functions ending in _ is one option. Notice group_by_ now replaces regroup.
set.seed(16)
foo = data.frame(country = rep(c("UK", "France"), each = 5),
id = rep(letters[1:5], times = 2),
value = runif(10, 50, 100), stringsAsFactors = FALSE)
First the code/results outside the function:
library(lazyeval)
library(dplyr)
foo %>%
mutate(new = ifelse(value > 60, 1, 0)) %>%
filter(id %in% c("a", "b", "d")) %>%
group_by(country) %>%
summarize(whatever = sum(value))
Source: local data frame [2 x 2]
country whatever
1 France 213.0009
2 UK 207.8331
Then turn the above process into a function:
myFun = function(x, ana, bob, cathy) {
x %>%
mutate_(new = interp(~ifelse(var > 60 , 1, 0), var = as.name(ana))) %>%
filter_(interp(~var %in% c("a", "b", "d"), var = as.name(bob))) %>%
group_by_(cathy) %>%
summarize_(whatever = interp(~sum(var), var = as.name(ana)))
}
Which gives the desired results.
myFun(foo, "value", "id", "country")
Source: local data frame [2 x 2]
country whatever
1 France 213.0009
2 UK 207.8331
For your second problem with arrange, I tried
myfun2 = function(x, ana, bob) x%>%
filter_(interp(~var %in% c("a", "b", "d"), var = as.name(ana))) %>%
arrange_(as.name(bob))
myfun2(foo, "id", "country")
Actually, your experiments do not work, you will have scoping problems with all of them. It looks like they are working because you have defined the vectors country, id, and value on the Global Environment and did not remove them. So when you call your functions, they are using the vectors from the Global Environment.
To show this, let's remove those vectors before calling your functions:
Creating the vectors and data.frame:
library(dplyr)
country <- rep(c("UK", "France"), each = 5)
id <- rep(letters[1:5], times = 2)
value <- runif(10, 50, 100)
foo <- data.frame(country, id, value, stringsAsFactors = FALSE)
Defining your first function:
myFun <- function(x, ana, bob, cathy) x %>%
mutate(new = ifelse(ana > 60, 1, 0)) %>%
filter(bob %in% c("a", "b", "d")) %>%
regroup(as.list(cathy)) %>%
summarize(whatever = sum(ana))
Calling without removing the vectors (it will look like it works, but it is actually using the vectors from the global env):
myFun(foo, value, id, "country")
Source: local data frame [2 x 2]
country whatever
1 France 208.1008
2 UK 192.4287
Now removing the vectors and calling your function (and now it does not work, for it can't find the vectors):
rm(country, id, value)
myFun(foo, value, id, "country")
Error in mutate_impl(.data, named_dots(...), environment()) :
object 'value' not found
So that explains why your arrange example did not work while the others did. The vector your second experiment was calling was the vector country on the Global Environment, which has 10 elements. But the function arrange was expecting only 6 elements, which is the result of the filtered vector.
You have different strategies to make your functions work. For example, take a look at this answer by G. Grothendieck to have some insights on how to do it. Or just wait a little, for as Hadley pointed out, programming in dplyr is a future feature coming soon.