How to pass a single row for a function using dplyr - r

I am trying to apply a custom function to a data.frame row by row, but I can't figure out how to apply the function row by row. I'm trying rowwise() as in the simple artificial example below:
library(tidyverse)
my_fun <- function(df, col_1, col_2){
df[,col_1] + df[,col_2]
}
dff <- data.frame("a" = 1:10, "b" = 1:10)
dff %>%
rowwise() %>%
mutate(res = my_fun(., "a", "b"))
How ever the data does not get passed by row. How can I achieve that?

dplyr's rowwise() puts the row-output (.data) as a list of lists, so you need to use [[. You also need to use .data rather than ., because . is the entire dff, rather than the individual rows.
my_fun <- function(df, col_1, col_2){
df[[col_1]] + df[[col_2]]
}
dff %>%
rowwise() %>%
mutate(res = my_fun(.data, 'a', 'b'))
You can see what .data looks like with the code below
dff %>%
rowwise() %>%
do(res = .data) %>%
.[[1]] %>%
head(1)
# [[1]]
# [[1]]$a
# [1] 1
#
# [[1]]$b
# [1] 1

Related

Creating a loop for filter() and group_by() from dplyr

In my toy data below, I'm repeating group_by() and filter() for variables: sample, group, and outcome (but not time).
I wonder if there is a functional solution such that we can provide the names of any number of variables that we want to group_by() and filter() in a loop-wise fashion inside a function like foo() shown below?
library(tidyverse)
data <- expand_grid(study=1:3,sample=1:2,group=1:3,outcome=c("A","B"),time=0:2)
get_rows <- function(x) { # Helper function used in `filter()`
u <- unique(x)
n <- sample(c(if(is.character(x)) 0 else min(u)-1, u), 1)
if(n == n[1]) TRUE else x == n
}
DF <- data %>%
group_by(study) %>%
filter(get_rows(sample)) %>% # for sample
ungroup()
DF2 <- DF %>%
group_by(study) %>%
filter(get_rows(group)) %>% # for group
ungroup()
DF3 <- DF2 %>%
group_by(study) %>%
filter(get_rows(outcome)) %>% # for outcome
ungroup()
#============================================ HOW TO LOOP ABOVE IN `foo()` BELOW?
foo <- function(data, ..., exclude_vars = c("time")){
## SOLUTION
}
You can loop over names of variables in strings if you use the dplyr .data pronoun. For example
foo <- function(data, exclude_vars = c("time", "study")){
vars <- setdiff(names(data), exclude_vars)
for (var in vars) {
data <- data %>%
group_by(study) %>%
filter(get_rows(.data[[var]])) %>%
ungroup()
}
data
}
foo(data)
If you prefer, you could use purrr::reduce rather than the loop
foo <- function(data, exclude_vars = c("time", "study")){
vars <- setdiff(names(data), exclude_vars)
cleanFn <- function(data, var) data %>%
group_by(study) %>%
filter(get_rows(.data[[var]])) %>%
ungroup()
reduce(vars, cleanFn, .init=data)
}
foo(data)

Arguments isn't passed on correctly to function in R

I've written my first function, but it's not working. I get the error: Error: Column var1 is unknown
when running the function.
Edit: the code below is part of a bigger chunk of code that also produces the graph, but that part works.
Code:
# Creating dummydata
a <- sample(letters[1:5], 500, rep = TRUE)
b <- sample(1:10, 500, rep = TRUE)
df1 <- data.frame(a, b)
create_barchart <- function(data, var1, var2) {
# Creating summary statistics
df <- data %>%
group_by(var1, var2) %>%
summarise(n=n()) %>%
group_by(var1) %>%
mutate(perc=100*n/sum(n))
}
create_barchart(df1, a, b)
Put {{...}} around var1 and var2 and remove df <-. Suggest you use ungroup to terminate the group_by.
Also note that count({{var1}}, {{var2}}) could be used in place of group_by({{var1}}, {{var2}}) %>% summarize(n = n()) %>% ungroup .
library(dplyr)
create_barchart <- function(data, var1, var2) {
# Creating summary statistics
data %>%
group_by({{var1}}, {{var2}}) %>%
summarise(n=n()) %>%
ungroup %>%
group_by({{var1}}) %>%
mutate(perc=100*n/sum(n)) %>%
ungroup
}
create_barchart(df1, a, b)
Additionally to Grothendieck's answer, you can use the enquo()-!! (pronounced as "Bang Bang") pair
create_barchart <- function(data, var1, var2) {
var1 <- enquo(var1)
var2 <- enquo(var2)
# Creating summary statistics
df <- data %>%
group_by(!!var1, !!var2) %>%
summarise(n=n()) %>%
group_by(!!var1) %>%
mutate(perc=100*n/sum(n))
return(df)
}
create_barchart(data = df1, var1 = a, var2 = b)
For a more in depth explanation you can also see this blog post.

faster way to make new variables containing data frames to be rbinded

I want to make a bunch of new variables a,b,c,d.....z to store tibble data frames. I will then rbind the new variables that store tibble data frames and export them as a csv. How do I do this faster without having to specify the new variables each time?
a<- subset(data.frame, variable1="condition1",....,) %>% group_by() %>% summarize( a=mean())
b<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
....
z<-subset(data.frame, variable1="condition2",....,) %>% group_by() %>% summarize( a=mean())
rbind(a,b,....,z)
There's got to be a faster way to do this. My data set is large so having it stored in memory as partitions of a,b,c,....z is causing the computer to crash. Typing the subset conditions to form the partitions repeatedly is tedious.
You could do something like this using purrr package:
You may need to use NSE depends on what's your condition. You can reference Programming with dplyr
purrr::map_df(
c("condition1","condition2",..., "conditionn"),
# .x for each condition
~ subset(your_data_frame, variable1=.x,....,) %>% group_by(some_columns) %>% summarise(a = mean(some_columns))
)
Example using iris:
library(rlang)
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, function(x){
iris %>%
dplyr::filter(!!rlang::parse_expr(x)) %>%
head()
})
Example using iris:
conditions <- c("Petal.Length>1.5","Species == 'setosa'","Sepal.Length > 5")
map(conditions, ~ iris %>% dplyr::filter(!!rlang::parse_expr(.x)) %>% nrow())
# or (!! is almost equivalent to eval or rlang::eval_tidy())
map(conditions, ~ iris %>% dplyr::filter(eval(rlang::parse_expr(.x))) %>% nrow())
[[1]]
[1] 113
[[2]]
[1] 50
[[3]]
[1] 118
Instead of creating multiple objects in the global environemnt, rread them in a list, and bind it
library(data.table)
files <- list.files(pattern = "\\.csv", full.names = TRUE)
rbindlist(lapply(files, fread))
It would be much faster with fread than in any other option
If we are using strings to be passed onto group_by, convert the string to symbol with sym from rlang and evaluate (!!)
library(purrr)
map2_df(c("condition1", "condition2"), c("a", "b") ~ df1 %>%
group_by(!! rlang::sym(.x)) %>%
summarise(!! .y := mean(colname)))
If the 'condition1', 'condition2' etc are expressions, place it as quosure and evaluate it
map2_df(quos(condition1, condition2), c("a", "b"), ~ df1 %>%
filter(!! .x) %>%
summarise(!! .y := mean(colnames)))
Using a reproducible example
conditions <- quos(Petal.Length>1.5,Species == 'setosa',Sepal.Length > 5)
map2(conditions, c('a', 'b', 'c'), ~
iris %>%
filter(!! .x) %>%
summarise(!! .y := mean(Sepal.Length)))
#[[1]]
# a
#1 6.124779
#[[2]]
# b
#1 5.006
#[[3]]
# c
#1 6.129661
It would be a 3 column dataset if we use map2_dfc
NOTE: It is not clear whether the OP meant 'condition1', 'condition2' as expressions to be passed on for filtering the rows or not.

Pass a data.frame column name to a function that uses purrr::map

I'm working with nested dataframes and want to pass the name of the top level dataframe, and the name of a column containing lower level dataframes, to a function that uses purrr::map to iterate over the lower level data frames.
Here's a toy example.
library(dplyr)
library(purrr)
library(tibble)
library(tidyr)
df1 <- tibble(x = c("a","b","c", "a","b","c"), y = 1:6)
df1 <- df1 %>%
group_by(x) %>%
nest()
testfunc1 <- function(df) {
df <- df %>%
mutate(out = map(data, min))
tibble(min1 = df$out)
}
testfunc2 <- function(df, col_name) {
df <- df %>%
mutate(out = map(col_name, min))
tibble(min2 = df$out)
}
df1 <- bind_cols(df1, testfunc1(df1))
df1 <- bind_cols(df1, testfunc2(df1, "data"))
df1$min1
df1$min2
testfunc1 behaves as expected, in this case giving the minimum of each data column in a new column. In testfunc2, where I've tried to pass the column name, a string reading "data" is passed to the new column. I think I understand from the thread here (Pass a data.frame column name to a function) why this doesn't behave as I want, but I haven't been able to figure out how to make it work in this case. Any suggestions would be great.
This should work for you, it uses the tidy eval framework. This assumes col_name is a string.
testfunc2 <- function(df, col_name) {
df <- df %>%
mutate(out = map(!! rlang::sym(col_name), min))
tibble(min2 = df$out)
}
EDIT:
If you'd rather pass a bare column name to the function, instead of a string, use enquo instead of sym.
testfunc2 <- function(df, col_name) {
col_quo = enquo(col_name)
df <- df %>%
mutate(out = map(!! col_quo, min))
tibble(min2 = df$out)
}

Make a list element of each group with dplyr's group_by function

I would like to be able to use more automation when creating SpatialLines objects from otherwise tidy data frames.
library(sp)
#create sample data
sample_data <- data.frame(group_id = rep(c("a", "b","c"), 10),
x = rnorm(10),
y = rnorm(10))
#How can I recreate this using dplyr?
a_list <- Lines(list(Line(sample_data %>% filter(group_id == "a") %>% select(x, y))), ID = 1)
b_list <- Lines(Line(list(sample_data %>% filter(group_id == "b") %>% select(x, y))), ID = 2)
c_list <- Lines(Line(list(sample_data %>% filter(group_id == "c") %>% select(x, y))), ID = 3)
SpatialLines(list(a_list, b_list, c_list))
You can see how using something like group_by would make the process pretty easy if you could understand how the data could be piped into a list.
Using your sample data, a wrapper function, and dplyr::do will give you what you want :)
wrapper <- function(df) {
df %>% select(x,y) %>% as.data.frame %>% Line %>% list %>% return
}
y <- sample_data %>% group_by(group_id) %>%
do(res = wrapper(.))
# and now assign IDs (since we can't do that inside dplyr easily)
ids = 1:dim(y)[1]
SpatialLines(
mapply(x = y$res, ids = ids, FUN = function(x,ids) {Lines(x,ID=ids)})
)
I don't use sp so there might be a better way to assign IDs.
For reference, consider reading Hadley's comments on returning non-dataframe from dplyr do calls

Resources