I want to make a bunch of new variables a,b,c,d.....z to store tibble data frames. I will then rbind the new variables that store tibble data frames and export them as a csv. How do I do this faster without having to specify the new variables each time?
a<- subset(data.frame, variable1="condition1",....,) %>% group_by() %>% summarize( a=mean())
b<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
....
z<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
rbind(a,b,....,z)
There's got to be a faster way to do this. My data set is large so having it stored in memory as partitions of a,b,c,....z is causing the computer to crash. Typing the subset conditions to form the partitions repeatedly is tedious.
You could do something like this using purrr package:
You may need to use NSE depends on what's your condition. You can reference Programming with dplyr
purrr::map_df(
c("condition1","condition2",..., "conditionn"),
# .x for each condition
~ subset(your_data_frame, variable1=.x,....,) %>% group_by(some_columns) %>% summarise(a = mean(some_columns))
)
Example using iris:
library(rlang)
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, function(x){
iris %>%
dplyr::filter(!!rlang::parse_expr(x)) %>%
head()
})
Example using iris:
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, ~ iris %>% dplyr::filter(!!rlang::parse_expr(.x)) %>% nrow())
# or (!! is almost equivalent to eval or rlang::eval_tidy())
map(conditions, ~ iris %>% dplyr::filter(eval(rlang::parse_expr(.x))) %>% nrow())
[[1]]
[1] 113
[[2]]
[1] 50
[[3]]
[1] 118
Instead of creating multiple objects in the global environemnt, rread them in a list, and bind it
library(data.table)
files <- list.files(pattern = "\\.csv", full.names = TRUE)
rbindlist(lapply(files, fread))
It would be much faster with fread than in any other option
If we are using strings to be passed onto group_by, convert the string to symbol with sym from rlang and evaluate (!!)
library(purrr)
map2_df(c("condition1", "condition2"), c("a", "b") ~ df1 %>%
group_by(!! rlang::sym(.x)) %>%
summarise(!! .y := mean(colname)))
If the 'condition1', 'condition2' etc are expressions, place it as quosure and evaluate it
map2_df(quos(condition1, condition2), c("a", "b"), ~ df1 %>%
filter(!! .x) %>%
summarise(!! .y := mean(colnames)))
Using a reproducible example
conditions <- quos(Petal.Length>1.5,Species == 'setosa',Sepal.Length > 5)
map2(conditions, c('a', 'b', 'c'), ~
iris %>%
filter(!! .x) %>%
summarise(!! .y := mean(Sepal.Length)))
#[[1]]
# a
#1 6.124779
#[[2]]
# b
#1 5.006
#[[3]]
# c
#1 6.129661
It would be a 3 column dataset if we use map2_dfc
NOTE: It is not clear whether the OP meant 'condition1', 'condition2' as expressions to be passed on for filtering the rows or not.
Related
I am finally trying to make the jump from for-loops to purrr. The basic examples are quite intuitive enough, but translating for-loops has been tricky, especially when trying to assign output objects to lists.
Consider the following example where I am trying to assign the output of a nested for-loop to a named list element.
library(tidyverse)
# output list
loop_list <- list()
# iterate over colors
for (color in diamonds %>% distinct(color) %>% pull()){
# sublist for each color
loop_list[[color]] <- list()
# iterate over cuts
for(cut in diamonds %>% distinct(cut) %>% pull()){
# filter data
data <- diamonds %>%
filter(color == color & cut == cut)
# define output
out <- data %>%
pull(price) %>%
mean()
# assign output to sublist of its color
loop_list[[color]][[cut]] <- out
# clean up filtered data set
rm(data)
}
}
This nested loop assigns the output object to its properly named sublist for each color of the data set. My purrr attempt creates something similar, but without the named sublists. All output objects are assigned to the same list, which is not what I'd ideally like.
grid <- expand_grid(color = diamonds %>% distinct(color) %>% pull(),
cut = diamonds %>% distinct(cut) %>% pull())
myfunc <- function(data, color, cut){
# create output object
out <- data %>%
# filter data
filter(color == color & cut == cut) %>%
pull(price) %>%
mean()
# return output
return(out)
}
purrr_list <- grid %>%
pmap(myfunc, data = diamonds)
Is there a way to arrive at the same output with purrr? I am aware that global assignment with <<- is a possibility, but this is generally discouraged, from what I understand.
It is better to change the argument names to avoid a clash with the actual column names.
myfunc <- function(data, col1, col2){
# filter data
data <- diamonds %>%
filter(color == col1 & cut == col2)
# define output
out <- data %>%
pull(price) %>%
mean()
# return output
return(out)
}
grid %>%
pmap_dbl(~ myfunc(diamonds, ..1, ..2))
#[1] 2597.550 3538.914 3423.644 3214.652 3682.312 4451.970 5946.181 5078.533 5255.880 4685.446 4918.186 6294.592 4574.173 5103.513 4975.655
#[16] 3889.335 5216.707 4276.255 4535.390 5135.683 3374.939 4324.890 3495.750 3778.820 3827.003 3720.706 4500.742 4123.482 3872.754 4239.255
#[31] 2629.095 3631.293 3405.382 3470.467 4291.061
If we need a nested output
library(dplyr)
grid %>%
split(.$color) %>%
map(~ pmap(.x, ~ myfunc(diamonds, ..1, ..2)) %>%
setNames(.x$cut))
-output
#$D
#$D$Ideal
#[1] 2629.095
#$D$Premium
#[1] 3631.293
#$D$Good
#[1] 3405.382
#$D$`Very Good`
#[1] 3470.467
#$D$Fair
#[1] 4291.061
#$E
#$E$Ideal
#[1] 2597.55
#$E$Premium
#[1] 3538.914
#$E$Good
#[1] 3423.644
# ..
I have a list of data frames with inconsistent but overlapping variables. Some of the shared variables have similar but not identical names. I would like to conditionally rename the variable so that it is consistent across datasets. The way to do this one at a time would be
library(tidyverse)
df_1 <- starwars
df_2 <- starwars %>% rename(haircolor = hair_color)
df_3 <- starwars
df_list <- list(df_1, df_2, df_3)
df_list[[2]] <- df_list[[2]] %>% rename(hair_color = haircolor)
But I would like this to be flexible such that I can just feed in a list of any size and it will rename any variable titled hair_color as haircolor. Is there a way to purrr::map over these in a way that renames conditionally on the variable existing? The most basic interpretation would look something like:
df_list %>%
purrr::map( ~ rename(., hair_color = haircolor))
We can pass this in a select_helpers function
library(dplyr)
library(purrr)
df_list %>%
purrr::map( ~ .x %>%
rename_at(vars(matches('hair_color')), ~ 'haircolor'))
Or use an if/else condition
df_list %>%
purrr::map( ~ if('hair_color' %in% names(.)) {
rename(., haircolor = hair_color)
} else .)
I am trying to apply a custom function to a data.frame row by row, but I can't figure out how to apply the function row by row. I'm trying rowwise() as in the simple artificial example below:
library(tidyverse)
my_fun <- function(df, col_1, col_2){
df[,col_1] + df[,col_2]
}
dff <- data.frame("a" = 1:10, "b" = 1:10)
dff %>%
rowwise() %>%
mutate(res = my_fun(., "a", "b"))
How ever the data does not get passed by row. How can I achieve that?
dplyr's rowwise() puts the row-output (.data) as a list of lists, so you need to use [[. You also need to use .data rather than ., because . is the entire dff, rather than the individual rows.
my_fun <- function(df, col_1, col_2){
df[[col_1]] + df[[col_2]]
}
dff %>%
rowwise() %>%
mutate(res = my_fun(.data, 'a', 'b'))
You can see what .data looks like with the code below
dff %>%
rowwise() %>%
do(res = .data) %>%
.[[1]] %>%
head(1)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 1
I am trying to use pipe mutate statement using a custom function. I looked a this somewhat similar SO post but in vain.
Say I have a data frame like this (where blob is some variable not related to the specific task but is part of the entire data) :
df <-
data.frame(exclude=c('B','B','D'),
B=c(1,0,0),
C=c(3,4,9),
D=c(1,1,0),
blob=c('fd', 'fs', 'sa'),
stringsAsFactors = F)
I have a function that uses the variable names so select some based on the value in the exclude column and e.g. calculates a sum on the variables not specified in exclude (which is always a single character).
FUN <- function(df){
sum(df[c('B', 'C', 'D')] [!names(df[c('B', 'C', 'D')]) %in% df['exclude']] )
}
When I gives a single row (row 1) to FUN I get the the expected sum of C and D (those not mentioned by exclude), namely 4:
FUN(df[1,])
How do I do similarly in a pipe with mutate (adding the result to a variable s). These two tries do not work:
df %>% mutate(s=FUN(.))
df %>% group_by(1:n()) %>% mutate(s=FUN(.))
UPDATE
This also do not work as intended:
df %>% rowwise(.) %>% mutate(s=FUN(.))
This works of cause but is not within dplyr's mutate (and pipes):
df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))
If you want to use dplyr you can do so using rowwise and your function FUN.
df %>%
rowwise %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The same can be achieved using group_by instead of rowwise (like you already tried) but with do instead of mutate
df %>%
group_by(1:n()) %>%
do({
result = as_data_frame(.)
result$s = FUN(result)
result
})
The reason mutate doesn't work in this case, is that you are passing the whole tibble to it, so it's like calling FUN(df).
A much more efficient way of doing the same thing though is to just make a matrix of columns to be included and then use rowSums.
cols <- c('B', 'C', 'D')
include_mat <- outer(function(x, y) x != y, X = df$exclude, Y = cols)
# or outer(`!=`, X = df$exclude, Y = cols) if it's more readable to you
df$s <- rowSums(df[cols] * include_mat)
purrr approach
We can use a combination of nest and map_dbl for this:
library(tidyverse)
df %>%
rowwise %>%
nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>%
unnest
Let's break that down a little bit. First, rowwise allows us to apply each subsequent function to support arbitrary complex operations that need to be applied to each row.
Next, nest will create a new column that is a list of our data to be fed into FUN (the beauty of tibbles vs data.frames!). Since we are applying this rowwise, each row contains a single-row tibble of exclude:D.
Finally, we use map_dbl to map our FUN to each of these tibbles. map_dbl is used over the family of other map_* functions since our intended output is numeric (i.e. double).
unnest returns our tibble into the more standard structure.
purrrlyr approach
While purrrlyr may not be as 'popular' as its parents dplyr and purrr, its by_row function has some utility here.
In your above example, we would use your data frame df and user-defined function FUN in the following way:
df %>%
by_row(..f = FUN, .to = "s", .collate = "cols")
That's it! Giving you:
# tibble [3 x 6]
exclude B C D blob s
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 B 1 3 1 fd 4
2 B 0 4 1 fs 5
3 D 0 9 0 sa 9
Admittedly, the syntax is a little strange, but here's how it breaks down:
..f = the function to apply to each row
.to = the name of the output column, in this case s
.collate = the way the results should be collated, by list, row, or column. Since FUN only has a single output, we would be fine to use either "cols" or "rows"
See here for more information on using purrrlyr...
Performance
Forewarning, while I like the functionality of by_row, it's not always the best approach for performance! purrr is more intuitive, but also at a rather large speed loss. See the following microbenchmark test:
library(microbenchmark)
mbm <- microbenchmark(
purrr.test = df %>% rowwise %>% nest(-blob) %>%
mutate(s = map_dbl(data, FUN)) %>% unnest,
purrrlyr.test = df %>% by_row(..f = FUN, .to = "s", .collate = "cols"),
rowwise.test = df %>%
rowwise %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
group_by.test = df %>%
group_by(1:n()) %>%
do({
result = as_tibble(.)
result$s = FUN(result)
result
}),
sapply.test = {df$s <- sapply(1:nrow(df), function(x) FUN(df[x,]))},
times = 1000
)
autoplot(mbm)
You can see that the purrrlyr approach is faster than the approach of using a combination of do with rowwise or group_by(1:n()) (see #konvas answer), and rather on par with the sapply approach. However, the package is admittedly not the most intuitive. The standard purrr approach seems to be the slowest, but also perhaps easier to work with. Different user-defined functions may change the speed order.
I am wondering how to tidy the following:
First I gather a selection of columns into a tibble with three columns: strain (=grouping factor), params (names of parameters) and values (the actual values)
sel <- t_tcellact %>% select(strain, contains("nbr_")) %>% gather(params, values, nbr_DP:nbr_CD3p)
Then I perform multiple pairwise.t.test():
test2 <- sel %>% bind_rows(sel) %>% split(.$params) %>% map(~ pairwise.t.test(x=.$values, g=.$strain, p.adj = "none"))
And the result is a list of the results from the pairwise.t.tests which I can start cleaning with:
test3 <- lapply(test2, tidy)
The list looks now like that:
$nbr_CD3p
group1 group2 p.value
1 SKG Balb/c 0.000001849548
$nbr_DN_CD69nCD25n
group1 group2 p.value
1 SKG Balb/c 0.6295371
and so on....
From this I need a tibble with following columns: parameter (e.g. nbr_CD3p), group1, group2, p.value.
In this example I had only two groups, but I want to do it in a generic way also applicable when I have multiple groups.
Does anybody have an idea how to get to this point in an elegant way (without a loop)?
You should be able to use bind_rows(), taking advantage of the .id argument:
test3 <- lapply(test2, tidy) %>%
bind_rows(.id = 'parameter')
That will use the names of test2 as a new column named parameter in the data frame. All that said, replacing lapply with map_df() as aosmith suggested in a comment above should also work.
I found a way to do that:
test2 <- sel %>% bind_rows(sel) %>% split(.$params) %>% map(~ pairwise.t.test(x=.$values, g=.$strain, p.adj = "none")) %>% lapply(tidy) %>% do.call("rbind", .) %>% mutate(params = rownames(.)) %>% as_tibble()