Joining list of data.frames from map() call - r

Is there a "tidyverse" way to join a list of data.frames (a la full_join(), but for >2 data.frames)? I have a list of data.frames as a result of a call to map(). I've used Reduce() to do something like this before, but would like to merge them as part of a pipeline - just haven't found an elegant way to do that. Toy example:
library(tidyverse)
## Function to make a data.frame with an ID column and a random variable column with mean = df_mean
make.df <- function(df_mean){
data.frame(id = 1:50,
x = rnorm(n = 50, mean = df_mean))
}
## What I'd love:
my.dfs <- map(c(5, 10, 15), make.df) #%>%
# <<some magical function that will full_join() on a list of data frames?>>
## Gives me the result I want, but inelegant
my.dfs.joined <- full_join(my.dfs[[1]], my.dfs[[2]], by = 'id') %>%
full_join(my.dfs[[3]], by = 'id')
## Kind of what I want, but I want to merge, not bind
my.dfs.bound <- map(c(5, 10, 15), make.df) %>%
bind_cols()

We can use Reduce
set.seed(1453)
r1 <- map(c(5, 10, 15), make.df) %>%
Reduce(function(...) full_join(..., by = "id"), .)
Or this can be done with reduce
library(purrr)
set.seed(1453)
r2 <- map(c(5, 10, 15), make.df) %>%
reduce(full_join, by = "id")
identical(r1, r2)
#[1] TRUE

Related

Stack a dataframe on itself multiple times in R?

I would like to stack a dataframe 100 times on itself, with an additional column indicating the iteration number, similar to what dplyr::bind_rows(..., .id = "id") does. Or is there a way I can save 100 times my dataframe into a single list and then use data.table::rbindlist()?
library(dplyr)
bind_rows(iris, iris, .id = "id") #This stacks the data only twice
library(data.table)
rbindlist(list(iris, iris), idcol = "id")
We could use replicate to return the datasets in a list and then use bind_rows or rbindlist
library(dplyr)
n <- 5
replicate(n, iris, simplify = FALSE) %>%
bind_rows(.id = 'id')
Or another option is purrr::rerun
library(purrr)
n %>%
rerun(iris) %>%
bind_rows(.id = 'id')
A base R option with rbind + lapply + cbind
> n <- 100
> do.call(rbind, lapply(seq(n), function(k) cbind(id = k, iris)))

dplyr group_by loop through different columns

I have the following data;
I would like to create three different dataframes using group_by and summarise dplyr functions. These would be df_Sex, df_AgeGroup and df_Type. For each of these columns I would like to perform the following function;
df_Sex = df%>%group_by(Sex)%>%summarise(Total = sum(Number))
Is there a way of using apply or lapply to pass the names of each of these three columns (Sex, AgeGrouping and Type) to these create 3 dataframes?
This will work but will create a list of data frames as your output
### Create your data first
df <- data.frame(ID = rep(10250,6), Sex = c(rep("Female", 3), rep("Male",3)),
Population = c(rep(3499, 3), rep(1163,3)), AgeGrouping =c(rep("0-14", 3), rep("15-25",3)) ,
Type = c("Type1", "Type1","Type2", "Type1","Type1","Type2"), Number = c(260,100,0,122,56,0))
gr <- list("Sex", "AgeGrouping","Type")
df_list <- lapply(gr, function(i) group_by(df, .dots=i) %>%summarise(Total = sum(Number)))
Here's a way to do it:
f <- function(x) {
df %>%
group_by(!!x) %>%
summarize(Total = sum(Number))
}
lapply(c(quo(Sex), quo(AgeGrouping), quo(Type)), f)
There might be a better way to do it, I haven't looked that much into tidyeval. I personally would prefer this:
library(data.table)
DT <- as.data.table(df)
lapply(c("Sex", "AgeGrouping", "Type"),
function(x) DT[, .(Total = sum(Number)), by = x])

r map_df to invoke_map translation

I would like to apply multiple functions over multiple columns in a data frame. I've worked out how to apply one function to all the columns in a data frame, but I'm stumped trying to use invoke_map to apply a list of functions. I've played around with quo and enquo, but I don't have the right combination (or grasp yet, I guess).
Toy example set up:
library(tidyverse)
library(RcppRoll)
library(purrr)
ID <- letters[1:26]
var1 <- sample(1:100, 26, replace= T)
var2 <- sample(100:200, 26, replace= T)
temp <- cbind(ID, var1, var2) %>% data.frame()
This works to apply one function:
roll.var <- function(name) {
label <- enquo(name)
map_df(temp[, 2:3], ~ name(.x, 5, fill= NA)) %>%
rename_all(funs(paste0(., '.', (!!label)))) %>%
cbind(temp, .)
}
test <- roll.var(roll_sdr)
Here's my attempt to use invoke_map to apply a list of functions to the chosen columns:
roll.func <- c("roll_sdr", "roll_varr")
invoke_map(roll.var, .x= roll.func)
And it returns: Error in name(.x, 51, fill = NA) : could not find function "name"
The second issue is that in the resulting 'test' data frame from the first example, the first variable is named incorrectly (var1.~) whereas the second one is named how I anticipated (var2.roll_sdr), can any one tell me why?
Any solution and/or education would be much appreciated!
EDIT:
Incorporating Mike's explanation that invoke_map needs a list of lists the complete code to produce what I want is:
library(tidyverse)
library(purrr)
library(RcppRoll)
library(plyr)
options(stringsAsFactors= F)
ID <- letters[1:26] %>% data.frame(ID= .)
var1 <- sample(1:100, 26, replace= T) %>% data.frame(var1= .)
var2 <- sample(100:200, 26, replace= T) %>% data.frame(var2= .)
temp <- bind_cols(ID, var1, var2)
roll.func <- list(list(roll_sdr, 'roll_sdr'),
list(roll_varr, 'roll_varr'))
roll.var <- function(name, vname) {
map_df(temp[, 2:3], ~ name(.x, 5, fill= NA)) %>%
rename_all(funs(paste0(., '.', vname))) %>%
cbind(temp, .)
}
df <- invoke_map(roll.var, roll.func)
## plyr statrment works much faster than purrr:reduce
df2 <- join_all(df, by= c('ID', 'var1', 'var2'))
Is it possible to add a statement in the roll.var function so that the vname doesn't have to be reiterated in roll.func? Somehow quote the name once inside the function? I've played around enquo and the rlang package and I'm not coming up with the right combination.
roll.func <- list(list(roll_sdr),
list(roll_varr))
would work both as a function call, and in appending the label to the variable name.
There are two problems with this.
The first problem is with the construct map_df(temp[, 2:3], ~ name(.x, 5, fill= NA)) - this doesn't work because it doesn't know what name is referring to. You will find it much easier in these types of cases to just pass the function object, and not the name of the function - that is, don't put it in quotes.
The second problem is that your construct roll.func isn't correct. Read the docs for invoke_map carefully - that argument must be a list. Each element of the list must be a list, the elements of which will be passed as arguments to the function. So, this simple example works:
library(purrr)
var1 <- sample(1:100, 26, replace= T) %>% as.numeric
var2 <- sample(100:200, 26, replace= T) %>% as.numeric
temp <- cbind(var1, var2) %>% data.frame()
simple_example <- function(func) map(temp, func)
roll.func <- list(
list(mean),
list(sum)
)
invoke_map(simple_example, roll.func)
#> [[1]]
#> [[1]]$var1
#> [1] 53.42308
#>
#> [[1]]$var2
#> [1] 140.6154
#>
#>
#> [[2]]
#> [[2]]$var1
#> [1] 1389
#>
#> [[2]]$var2
#> [1] 3656
and you should be able to adapt that to do what you need.

Why do i got different results using SE or NSE dplyr functions

Hi I got differents results from dplyr function when I use standard evaluation through lazyeval package.
Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.
library(dplyr)
# random datetime generation function by Dirk Eddelbuettel
# http://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/13") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(42)
# Creating 230000 ids couples
ids <- data_frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"),
id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
datas <- mutate(ids, datetime = rand.datetime(25e4))
When I use the NSE way I got 230000 rows
df1 <-
datas %>%
group_by(id1, id2) %>%
filter(datetime == max(datetime))
nrow(df1) #230000
But when I use the SE, I got only 229977 rows
ids <- c("id1", "id2")
filterVar <- "datetime"
filterFun <- "max"
df2 <-
datas %>%
group_by_(ids) %>%
filter_(.dots = lazyeval::interp(~var == fun(var),
var = as.name(filterVar),
fun = as.name(filterFun)))
nrow(df2) #229977
My two pieces of code are equivalent right ?
Why do I experience different results ? Thanks.
You'll need to specify the .dots argument in group_by_ when giving a vector of column names.
df2 <- datas %>%
group_by_(.dots = ids) %>%
filter_(.dots = lazyeval::interp(~var == fun(var),
var = as.name(filterVar),
fun = as.name(filterFun)))
nrow(df2)
[1] 230000
It looks like group_by_ might take the first column name from the vector as the only grouping variable when you don't specify the .dots argument. You can check this by grouping on id1 only.
df1 <- datas %>%
group_by(id1) %>%
filter(datetime == max(datetime))
nrow(df1)
[1] 229977
(If you group just on id2 the number of rows is 229976).

How to get the name of a data.frame within a list?

How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)
I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l
Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))
you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.

Resources