dplyr group_by loop through different columns - r

I have the following data;
I would like to create three different dataframes using group_by and summarise dplyr functions. These would be df_Sex, df_AgeGroup and df_Type. For each of these columns I would like to perform the following function;
df_Sex = df%>%group_by(Sex)%>%summarise(Total = sum(Number))
Is there a way of using apply or lapply to pass the names of each of these three columns (Sex, AgeGrouping and Type) to these create 3 dataframes?

This will work but will create a list of data frames as your output
### Create your data first
df <- data.frame(ID = rep(10250,6), Sex = c(rep("Female", 3), rep("Male",3)),
Population = c(rep(3499, 3), rep(1163,3)), AgeGrouping =c(rep("0-14", 3), rep("15-25",3)) ,
Type = c("Type1", "Type1","Type2", "Type1","Type1","Type2"), Number = c(260,100,0,122,56,0))
gr <- list("Sex", "AgeGrouping","Type")
df_list <- lapply(gr, function(i) group_by(df, .dots=i) %>%summarise(Total = sum(Number)))

Here's a way to do it:
f <- function(x) {
df %>%
group_by(!!x) %>%
summarize(Total = sum(Number))
}
lapply(c(quo(Sex), quo(AgeGrouping), quo(Type)), f)
There might be a better way to do it, I haven't looked that much into tidyeval. I personally would prefer this:
library(data.table)
DT <- as.data.table(df)
lapply(c("Sex", "AgeGrouping", "Type"),
function(x) DT[, .(Total = sum(Number)), by = x])

Related

R - pipe multiple dataframes in a loop

I need to repeat a simple operation for over 50 dataframes, this calls for a loop, but I can't put together the right code.
I am creating a new dataframe with only 4 variables that are obtained by grouping and summarising with dplyr.
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
E5000_stat <- E5000_stat %>%
group_by(indeximage) %>%
summarise(n_drop = n(), median_area = median(Area..mm.2..), tot_area = sum(Area..mm.2..))
I would like to have the same operation repeated in a loop for all the dataframes, so not to have to manually modify and re-run the same 4 lines of codes 50 times.
Any help is highly appreciated.
You can use purrr::map or purrr::map_df (depending if you want the result to be a tibble or a `list):
E_stat_func <- . %>%
group_by(indeximage) %>%
summarise(
n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..)
)
dataframes_summary <- dataframes %>%
# map(E_stat_func)
map_df(E_stat_func)
Use lapply or purrr::map -
library(dplyr)
apply_fun <- function(df) {
df %>%
group_by(indeximage) %>%
summarise(n_drop = n(),
median_area = median(Area..mm.2..),
tot_area = sum(Area..mm.2..))
}
dataframes <- list(E5000, E5015, E5030, E5045, E5060, E5075, E5090)
out <- lapply(dataframes, apply_fun)
out

How can I simultaneously assign value to multiple new columns with R and dplyr?

Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.

Using mutate_at() on a nested dataframe column to generate multiple unnested columns

I'm experimenting with dplyr, tidyr and purrr. I have data like this:
library(tidyverse)
set.seed(123)
df <- data_frame(X1 = rep(LETTERS[1:4], 6),
X2 = sort(rep(1:6, 4)),
ref = sample(1:50, 24),
sampl1 = sample(1:50, 24),
var2 = sample(1:50, 24),
meas3 = sample(1:50, 24))
Now dplyr is awesome because I can do things like mutate_at() to manipulate multiple columns at once. e.g:
df <- df %>%
mutate_at(vars(-one_of(c("X1", "X2", "ref"))), funs(first = . - ref)) %>%
mutate_at(vars(contains("first")), funs(second = . *2 ))
and tidyr allows me nest subsets of the data as sub-tables in a single column:
df <- df %>% nest(-X1)
and thanks to purrr I can summarize these sub-tables while retaining the original data in the nested column:
df %>% mutate(mean = map_dbl(data, ~ mean(.x$meas3_first_second)))
How can I use purrr and mutate_at() to generate multiple summary columns (take the means of different (but not all) columns in each nested sub-table)?
In this example I'd like to take the mean of every column with the word "second" in it.I had hoped that this might produce a new nested column which I could then unnest() but it does not work.
df %>% mutate(mean = map(data, ~ mutate_at(vars(contains("second")),
funs(mean_comp_exp = mean(.)))))
How can I achieve this?
The comment by #aosmith was correct and helpful In addition I realised I needed to use summarise_at() and not mutate_at() like so:
df %>%
mutate(mean = map(data, ~ summarise_at(.x, vars(contains("second")),
funs(mean_comp_exp = mean(.) )))) %>%
unnest(mean)

Joining list of data.frames from map() call

Is there a "tidyverse" way to join a list of data.frames (a la full_join(), but for >2 data.frames)? I have a list of data.frames as a result of a call to map(). I've used Reduce() to do something like this before, but would like to merge them as part of a pipeline - just haven't found an elegant way to do that. Toy example:
library(tidyverse)
## Function to make a data.frame with an ID column and a random variable column with mean = df_mean
make.df <- function(df_mean){
data.frame(id = 1:50,
x = rnorm(n = 50, mean = df_mean))
}
## What I'd love:
my.dfs <- map(c(5, 10, 15), make.df) #%>%
# <<some magical function that will full_join() on a list of data frames?>>
## Gives me the result I want, but inelegant
my.dfs.joined <- full_join(my.dfs[[1]], my.dfs[[2]], by = 'id') %>%
full_join(my.dfs[[3]], by = 'id')
## Kind of what I want, but I want to merge, not bind
my.dfs.bound <- map(c(5, 10, 15), make.df) %>%
bind_cols()
We can use Reduce
set.seed(1453)
r1 <- map(c(5, 10, 15), make.df) %>%
Reduce(function(...) full_join(..., by = "id"), .)
Or this can be done with reduce
library(purrr)
set.seed(1453)
r2 <- map(c(5, 10, 15), make.df) %>%
reduce(full_join, by = "id")
identical(r1, r2)
#[1] TRUE

How to get the name of a data.frame within a list?

How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)
I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l
Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))
you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.

Resources