I'm writing functions that take in a data.frame and then do some operations. I need to add and subtract items from the group_by criteria in order to get where I want to go.
If I want to add a group_by criteria to a df, that's pretty easy:
library(tidyverse)
set.seed(42)
n <- 10
input <- data.frame(a = 'a',
b = 'b' ,
vals = 1
)
input %>%
group_by(a) ->
grouped
grouped
#> # A tibble: 1 x 3
#> # Groups: a [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## add a group:
grouped %>%
group_by(b, add=TRUE)
#> # A tibble: 1 x 3
#> # Groups: a, b [1]
#> a b vals
#> <fct> <fct> <dbl>
#> 1 a b 1.
## drop a group?
But how do I programmatically drop the grouping by b which I added, yet keep all other groupings the same?
Here's an approach that uses tidyeval so that bare column names can be used as the function arguments. I'm not sure if it makes sense to convert the bare column names to text (as I've done below) or if there's a more elegant way to work directly with the bare column names.
drop_groups = function(data, ...) {
groups = map_chr(groups(data), rlang::quo_text)
drop = map_chr(quos(...), rlang::quo_text)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by_at(setdiff(groups, drop))
}
d = mtcars %>% group_by(cyl, vs, am)
groups(d %>% drop_groups(vs, cyl))
[[1]]
am
groups(d %>% drop_groups(a, vs, b, c))
[[1]]
cyl
[[2]]
am
Warning message:
In drop_groups(., a, vs, b, c) :
Input data frame is not grouped by the following groups: a, b, c
UPDATE: The approach below works directly with quosured column names, without converting them to strings. I'm not sure which approach is "preferred" in the tidyeval paradigm, or whether there is yet another, more desirable method.
drop_groups2 = function(data, ...) {
groups = map(groups(data), quo)
drop = quos(...)
if(any(!drop %in% groups)) {
warning(paste("Input data frame is not grouped by the following groups:",
paste(drop[!drop %in% groups], collapse=", ")))
}
data %>% group_by(!!!setdiff(groups, drop))
}
Maybe something like this to remove grouping variables from the end of the list back:
grouped %>%
group_by(b, add=TRUE) -> grouped
grouped %>% group_by_at(.vars = group_vars(.)[-2])
or use head or tail or something on the output from group_vars for more control.
It would be interesting to have this sort of utility function available more generally:
peel_groups <- function(.data,n){
.data %>%
group_by_at(.vars = head(group_vars(.data),-n))
}
A more thought out version would likely include more careful checks on n being out of bounds.
Function to remove groups by column name
drop_groups_at <- function(df, vars){
df %>%
group_by_at(setdiff(group_vars(.), vars))
}
input %>%
group_by(a, b) %>%
drop_groups_at('b') %>%
group_vars
# [1] "a"
Related
I am starting to learn how to use dplyr's pipe (%>%) command for manipulating data frames. I like that it seems much more streemlined. However, I just encountered a problem that I could not solve with only pipes.
I have a data frame which holds relationship (network) data which looks like this:
The first two columns indicate what items (genes) there is a relationship between, and the third column contains information about that relationship:
a b c
1 Gene_1 Gene_2 X
2 Gene_2 Gene_3 R
3 Gene_1 Gene_4 X
My goal is to get a list of unique genes that share the same attribute. If the attribute X in col 3 is selected, I would get this data frame:
a b c
1 Gene_1 Gene_2 X
3 Gene_1 Gene_4 X
And I would want to end with this list of unique genes:
genes = c("Gene_1" "Gene_2" "Gene_4")
It does not matter if the item (Gene) comes from the first column or the second, I just want a unique list. I came up with this solution:
library(tidyr)
net = tibble(a = c("Gene_1", "Gene_2", "Gene_1"),
b = c("Gene_2", "Gene_3", "Gene_4"),
c = c("X", "R", "X"))
df = net %>%
filter(c == "X") %>%
select(c(1,2))
genes = unique(c(df$a, df$b))
but am not satisfied, as I was not able to do everything within the dplyr pipe commands. I had to make a list outside of the pipe commands, and then call unique on it.
Is there a way to accomplish this task with a call to another pipe? I could not find anyway to do this. Thanks.
1) Use {...} like this:
net %>%
filter(c == "X") %>%
select(c(1,2)) %>%
{ unique(c(.$a, .$b)) }
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
2) or use magrittr's %$% pipe:
library(magrittr)
net %>%
filter(c == "X") %>%
select(c(1,2)) %$%
unique(c(a, b))
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
3) or use with:
net %>%
filter(c == "X") %>%
select(c(1,2)) %>%
with(unique(c(a, b)))
## [1] "Gene_1" "Gene_3" "Gene_2" "Gene_5"
Since the result is not a data frame best not call it df.
The unlist() function is probably what you are looking for.
Quoting from the built in documentation for ?unlist: "Given a list structure x, unlist simplifies it to produce a vector which contains all the atomic components which occur in x."
Since R data frames (and tibbles) are implemented as lists of column vectors with equal lengths, the unlist function will effectively convert a data frame into a vector.
Subset for the desired rows and columns with filter and select, then pipe the result through unlist() and then unique(). The result will be a vector with the distinct elements.
library(dplyr)
# The example data
tibble(a = c("Gene_1", "Gene_2", "Gene_1"),
b = c("Gene_2", "Gene_3", "Gene_4"),
c = c("X", "R", "X")) %>%
# Subset data for desired feature
filter(c == "X") %>%
# Select identifier columns
select(a, b) %>%
# convert to a vector
unlist() %>%
# derive unique elements
unique()
Result
[1] "Gene_1" "Gene_2" "Gene_4"
I would suggest using tidyr::pivot_longer to reshape the multiple columns of potential matches from the two distinct gene columns, to a value column (which we care about) and a name column (referencing the original column name, which we don't care about and can ignore). Then distinct to get unique matches, and finally the match to column c:
net %>%
pivot_longer(-c) %>%
distinct(c, value) %>%
filter(c == "X")
If you want the result as a vector, you could add %>% pull(value).
One benefit of this approach is that we already have every distinct set of genes for every column c value calculated, and the last filter step just narrows it to one example c value.
Result
c value
<chr> <chr>
1 X Gene_1
2 X Gene_2
3 X Gene_4
[Note: I made a = c("Gene_1", "Gene_2", "Gene_1") and b = c("Gene_2", "Gene_3", "Gene_4") to match example.]
I realize this question has several answers, but I would have gone a slightly different way with it. Perhaps it will be useful to someone?
I created a data set to demonstrate, as well.
library(tidyverse)
library(stringi) # only used in data generation
# data set creation 100 rows
a = paste0("Gene_",1:100)
b = paste0("Gene_",round(runif(100, 10, 99),digits = 0))
cC = paste0(stringi::stri_rand_strings(100, 1, '[A-Z]'))
# put it together and strip the information
data.frame(a = a, b = b, cC = cC) %>% # collect the data
filter(cC == "X") %>% # filter for attribute
select(-cC) %>% # remove attribute field
unlist() %>% # collapse the data frame into a vector
unique() # show me what's unique
# output example
# [1] "Gene_10" "Gene_12" "Gene_28" "Gene_77" "Gene_22" "Gene_41" "Gene_75"
# [8] "Gene_19"
library(tidyverse)
net <- tibble(
a = c("Gene_1", "Gene_1", "Gene_3"),
b = c("Gene_2", "Gene_4", "Gene_5"),
c = c("X", "R", "X")
)
df <- net %>%
filter(c == "X") %>%
select(a, b)
df
#> # A tibble: 2 x 2
#> a b
#> <chr> <chr>
#> 1 Gene_1 Gene_2
#> 2 Gene_3 Gene_5
genes <- net %>%
select(-c) %>%
unlist() %>%
unique()
genes
#> [1] "Gene_1" "Gene_3" "Gene_2" "Gene_4" "Gene_5"
Though many enlightening answers have been proposed and accepted by OP too, I just want to add that in case, you want it simultaneously for all values in c, do this
library(tidyverse)
net %>%
group_split(c, .keep = F) %>%
setNames(unique(net$c)) %>%
map(~ (.x %>% unlist() %>% unique()))
$X
[1] "Gene_2" "Gene_3"
$R
[1] "Gene_1" "Gene_2" "Gene_4"
On a fairly regular basis I want to pass in strings that function as arguments in code. For context, I often want a section where I can pass in filtering criteria or assumptions that then flow through my analysis, plots, etc. to make it more interactive.
A simple example is below. I've seen the eval/parse solution, but it seems like that makes code chunks unreadable. Is there a better/cleaner/shorter way to do this?
column.names <- c("group1", "group2") #two column names I want to be able to toggle between for grouping
select.column <- group.options[1] #Select the column for grouping
DataTable.summary <-
DataTable %>%
group_by(select.column) %>% #How do I pass that selection in here?
summarize(avg.price = mean(SALES.PRICE))
Well this is just a copy-paste from the tidyverse website: link:(https://dplyr.tidyverse.org/articles/programming.html#programming-recipes).
my_summarise <- function(df, group_var) {
group_var <- enquo(group_var)
print(group_var)
df %>%
group_by(!! group_var) %>%
summarise(a = mean(a))
}
my_summarise(df, g1)
#> <quosure>
#> expr: ^g1
#> env: global
#> # A tibble: 2 x 2
#> g1 a
#> <dbl> <dbl>
#> 1 1 2.5
#> 2 2 3.33
But I think i illustrates your problem. I think what you really want to do is like the code above, i.e. create a function.
You can use the group_by_ function for the example in your question:
library(dplyr)
x <- data.frame(group1 = letters[1:4], group2 = LETTERS[1:4], value = 1:4)
select.colums <- c("group1", "group2")
x %>% group_by_(select.colums[2]) %>% summarize(avg = mean(value))
# A tibble: 4 x 2
# group2 avg
# <fct> <dbl>
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
The *_ family functions in dplyr might also offer a more general solution you are after, although the dplyr documentation says they are deprecated (?group_by_) and might disappear at some point. An analogous expression to the above solution using the tidy evaluation syntax seems to be:
x %>% group_by(!!sym(select.colums[2])) %>% summarize(avg = mean(value))
And for several columns:
x %>% group_by(!!!syms(select.colums)) %>% summarize(avg = mean(value))
This creates a symbol out of a string that is evaluated by dplyr.
I recommend using group_by_at(). It supports both single strings or character vectors:
nms <- c("cyl", "am")
mtcars %>% group_by_at(nms)
Code
Suppose I have the following code (I know, instead of the second do, I could use a simple mutate in this case (and skip the rowwise()), but that is not the point, as in my real code the second do is a bit more complicated and calculates a model):
library(dplyr)
set.seed(1)
d <- data_frame(n = c(5, 1, 3))
e <- d %>% group_by(n) %>%
do(data_frame(y = rnorm(.$n), dat = list(data.frame(a = 1))))
e %>% rowwise() %>% do(data_frame(sum = .$y + .$n))
# Source: local data frame [9 x 1]
# Groups: <by row>
# # A tibble: 9 x 1
# sum
# * <dbl>
# 1 0.3735462
# 2 3.1836433
# 3 2.1643714
# 4 4.5952808
# 5 5.3295078
# 6 4.1795316
# 7 5.4874291
# 8 5.7383247
# 9 5.5757814
Problem
As you can see, the result contains only the column sum.
Question
Is there a way to keep the original columns from e without needing to specify them explicitly (like in e %>% do(data_frame(n = .$n, y = .$y, dat = .$dat, sum = .$y + .$n)) in dplyr or do I have to use purrrlyr::by_row? (not that I do not like purrrlyr*, I was just wondering whether there is a straight forward dplyr way of doing it which I may have overloooked):
e %>% purrrlyr::by_row(function(x) x$y + x$n, .collate = "cols", .to = "sum")
*) Well, there is in fact a catch with purrrlyr::by_row:
e %>% purrrlyr::by_row(function(x) data_frame(sum = x$y + x$n, diff = x$y - x$n),
.collate ="cols")
Will produce columns sum1 and diff1 which I would need to rename again to get sum and diff, which adds another line of code.
I almost never use do, but rather do a combination of nest, mutate and map.
It's a bit hard to tell how that would look in your case, as your example doesn't seem to fully specify your needs.
In the simplest case, you could specify the variables that you do need (if they would be lists of S3 objects, for example):
mutate(e, sum = map2_dbl(y, n, `+`))
Or, you could nest the required data then map the whole data. E.g.:
f <- e
f$r <- 1:nrow(e) # i.e. add some other variable, not necessarily row indices
f %>%
ungroup() %>% # e was still grouped
nest(n:dat) %>% # specify what you variables you need
mutate(sum = map_dbl(data, ~.$y + .$n)) %>% # map to data, use the same formula as in do
unnest() # unnest to get original columns back
Both leave the original columns untouched.
For a modeling example, e.g.:
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(model = map(data, ~lm(qsec ~ hp, .)),
coef = map_dbl(model, ~coef(.)[2])) %>%
unnest(data)
This will give you all your original data, but with added regression coefficents per group. Before unnesting, the whole models are in your data.frame as a list column.
I am struggling a little with dplyr because I want to do two things at one and wonder if it is possible.
I want to calculate the mean of values and at the same time the mean for the values which have a specific value in an other column.
library(dplyr)
set.seed(1234)
df <- data.frame(id=rep(1:10, each=14),
tp=letters[1:14],
value_type=sample(LETTERS[1:3], 140, replace=TRUE),
values=runif(140))
df %>%
group_by(id, tp) %>%
summarise(
all_mean=mean(values),
A_mean=mean(values), # Only the values with value_type A
value_count=sum(value_type == 'A')
)
So the A_mean column should calculate the mean of values where value_count == 'A'.
I would normally do two separate commands and merge the results later, but I guess there is a more handy way and I just don't get it.
Thanks in advance.
We can try
df %>%
group_by(id, tp) %>%
summarise(all_mean = mean(values),
A_mean = mean(values[value_type=="A"]),
value_count=sum(value_type == 'A'))
You can do this with two summary steps:
df %>%
group_by(id, tp, value_type) %>%
summarise(A_mean = mean(values)) %>%
summarise(all_mean = mean(A_mean),
A_mean = sum(A_mean * (value_type == "A")),
value_count = sum(value_type == "A"))
The first summary calculates the means per value_type and the second "sums" only the mean of value_type == "A"
You can also give the following function a try:
?summarise_if
(the function family is summarise_all)
Example
The dplyr documentation serves a quite good example of this, i think:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns. Here we apply mean() to the numeric columns:
starwars %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 1 x 3
#> height mass birth_year
#> <dbl> <dbl> <dbl>
#> 1 174. 97.3 87.6
The interesting thing here is the predicate function. This represents the rule by which the columns, that will have to be summarized, are selected.
I'm Trying to create new dataframes from dplyr 0.4.3 functions using R 3.2.2.
What I want to do is create some new dataframes using dplyr::filter to separate out data from one ginormous dataframe into a bunch of smaller dataframes.
For my reproducible base case bog simple example, I used this:
filter(mtcars, cyl == 4)
I know I need to assign that to a dataframe of its own, so I started with:
paste("Cylinders:", x, sep = "") <- filter(mtcars, cyl == 4))
That didn't work -- it gave me the error found here: Assignment Expands to Non-Language Object
From there, I found this: Create A Variable Name with Paste in R
(also, big ups to the authors of the above)
And that led me to this, which works:
assign(paste("gears_cars_cylinders", 4, sep = "_"), filter(mtcars, cyl == 4)) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
and by "works," I mean I get a dataframe named gears_cars_cylinders_4 with all the goodies from
filter(mtcars, cyl == 4) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
But ultimately, I think I need to wrap this whole thing in a function and be able to feed it the cylinder numbers from mtcars$cyl. I'm thinking something like plyr::ldply(mtcars$cyl, function_name)?
In my real-life data, I have about 70 different classes I need to split out into separate dataframes to drop into DT::datatable tabs in Shiny, which is a whole nuther mess. Anyway.
When I try this:
function_name <- function(x){
assign(paste("gears_cars_cylinders", x, sep = "_"), filter(mtcars, cyl == x)) %>%
group_by(gear) %>%
summarise(number_of_cars = n())
}
and then function_name(6),
I get the output of the dataframe to the screen, but not a dataframe with the name.
Am I looking right over the answer here?
You need to assign the new data frames into the environment from which you're calling function_name(). Try something like this:
library(dplyr)
foo <- function(x) {
assign(paste("gears_cars_cylinders", x, sep = "_"),
envir = parent.frame(),
value = mtcars %>%
filter(cyl == x) %>%
count(gear))
}
for(cyl in sort(unique(mtcars$cyl))) foo(cyl)
ls()
#> [1] "cyl" "foo"
#> [3] "gears_cars_cylinders_4" "gears_cars_cylinders_6"
#> [5] "gears_cars_cylinders_8"
gears_cars_cylinders_4
#> Source: local data frame [3 x 2]
#>
#> gear n
#> (dbl) (int)
#> 1 3 1
#> 2 4 8
#> 3 5 2