Suppose I have a list of data.frames:
list <- list(A=data.frame(x=c(1,2),y=c(3,4)), B=data.frame(x=c(1,2),y=c(7,8)))
I want to combine them into one data.frame like this:
data.frame(x=c(1,2,1,2), y=c(3,4,7,8), group=c("A","A","B","B"))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
I can do in this way:
add_group_name <- function(df, group) {
df$group <- group
df
}
Reduce(rbind, mapply(add_group_name, list, names(list), SIMPLIFY=FALSE))
But I want to know if it's possible to get the name inside the lapply loop without the use of names(list), just like this:
add_group_name <- function(df) {
df$group <- ? #How to get the name of df in the list here?
}
Reduce(rbind, lapply(list, add_group_name))
I renamed list to listy to remove the clash with the base function. This is a variation on SeƱor O's answer in essence:
do.call(rbind, Map("[<-", listy, TRUE, "group", names(listy) ) )
# x y group
#A.1 1 3 A
#A.2 2 4 A
#B.1 1 7 B
#B.2 2 8 B
This is also very similar to a previous question and answer here: r function/loop to add column and value to multiple dataframes
The inner Map part gives this result:
Map("[<-", listy, TRUE, "group", names(listy) )
#$A
# x y group
#1 1 3 A
#2 2 4 A
#
#$B
# x y group
#1 1 7 B
#2 2 8 B
...which in long form, for explanation's sake, could be written like:
Map(function(data, nms) {data[TRUE,"group"] <- nms; data;}, listy, names(listy) )
As #flodel suggests, you could also use R's built in transform function for updating dataframes, which may be simpler again:
do.call(rbind, Map(transform, listy, group = names(listy)) )
I think a much easier approach is:
> do.call(rbind, lapply(names(list), function(x) data.frame(list[[x]], group = x)))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
Using plyr:
ldply(ll)
.id x y
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 8
Or in 2 steps :
xx <- do.call(rbind,ll)
xx$group <- sub('([A-Z]).*','\\1',rownames(xx))
xx
x y group
A.1 1 3 A
A.2 2 4 A
B.1 1 7 B
B.2 2 8 B
Related
To start: I've seen this post and no, tidyr's unnest doesn't work here. I am doing an lapply where the returning function returns a list with named entries (see example func at the bottom for clarity):
ls <- lapply(x, func)
Now if I look at ls, it is a list of lists, and in the R studio data viewer it appears as having Name, Type, and Value columns.
Now, if I use
df <- bind_rows(ls)
I get exactly what I want, except I then need to bind the dataframe containing x to df. This is the problem, because for each x, func will return a variable number of rows, which means I need to run an equivalent of bind_rows after I have already attached ls to my dataframe.
An example is as below:
func <- function(x){
res <- list()
res$name <- 1:x
res$val <- 1:x
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- lapply(df$nums, func)
bind_rows(ls) gives:
name val
<int> <int>
1 1 1
2 1 1
3 2 2
4 1 1
5 2 2
6 3 3
and the desired output is:
name val nums letters
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 2 B
3 2 2 2 B
4 1 1 3 C
5 2 2 3 C
6 3 3 3 C
Note that func here creates n rows given x = n. This is not the case for my actual function. func(n) can produce any positive number of rows.
Maybe you're looking for something more "canned", but you could write a function that would produce the desired output like this:
out <- function(data, varname){
l <- lapply(data[[varname]], func)
l <- lapply(1:length(l), function(x)do.call(data.frame, c(l[[x]], zz_obs=x)))
l <- do.call(rbind, l)
data$zz_obs <- 1:nrow(data)
if(!all(data$obs %in% l$obs))warning("Not all rows of data in output\n")
data <- dplyr::full_join(l, data, by="zz_obs")
data[,-which(names(data) == "zz_obs")]
}
out(df, "nums")
# name val nums letters
# 1 1 1 1 A
# 2 1 1 2 B
# 3 2 2 2 B
# 4 1 1 3 C
# 5 2 2 3 C
# 6 3 3 3 C
You can try mapply which is similar to lapply, but allows multiple vectors or lists to be passed to iterate over their values:
library(dplyr)
func <- function(x, y){
res <- list()
res$name <- 1:x
res$val <- 1:x
res$let <- rep(y, x)
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- mapply(
func,
x = df$nums,
y =df$letters,
SIMPLIFY = FALSE
)
bind_rows(ls)
# A tibble: 6 x 3
# name val let
# <int> <int> <chr>
# 1 1 1 A
# 2 1 1 B
# 3 2 2 B
# 4 1 1 C
# 5 2 2 C
# 6 3 3 C
In the interim, the function I will be using to do this is:
merge_and_flatten <- function(x, y){
for (i in 1:nrow(x)){
y[[i]][names(x)] <- lapply(x[i, ], rep, times = length(y[[i]][[1]]))
}
return(bind_rows(y))
}
This is the cleanest solution I could come up with. Here, x serves and my df, and y serves as ls. It works by reducing the problem to bind_rows: it simply adds elements to ls which contain the columns in x. I absolutely want a cleaner solution, but this works for anyone who needs it.
I need a way to identify the minimum value in a particular column presents in all dataframes in a list of dataframes and replace it with some non-numeric character. For example:
df1 <- data.frame(x=c("a","b","c"), y=c(2,4,6))
df2 <- data.frame(x=c("a","b","c"), y=c(10,20,30))
myList <- list(df1, df2)
[[1]]
x y
1 a 2
2 b 4
3 c 6
[[2]]
x y
1 a 10
2 b 20
3 c 30
should become
[[1]]
x y
1 a *
2 b 4
3 c 6
[[2]]
x y
1 a *
2 b 20
3 c 30
What's the best way? It would be great if someone knew a Base R and external packages (purrr) solution.
Thanks!
Here is a base R option
lapply(myList, function(df) transform(df, y = replace(y, which.min(y), "*")))
#[[1]]
# x y
#1 a *
#2 b 4
#3 c 6
#
#[[2]]
# x y
#1 a *
#2 b 20
#3 c 30
Or the same in the tidyverse
library(tidyverse)
map(myList, ~.x %>% mutate(y = replace(y, which.min(y), "*")))
for(i in 1:length(myList)){
currMin = min(myList[[i]]$y)
myList[[i]]$y[myList[[i]]$y==currMin] <- '*'
}
please note, assigning '*' will convert type to character
I am trying to write a multi-merge alternative to merge which can merge-together more than two datasets on a single key.
The code I have is like this:
multimerge <- function(..., by, all=T) {
value <- list(...)
Reduce(function(x,y)merge(x,y,by=by, all=all), value)
}
But the thing I want to multi-merge is a list. Is it possible to pass a list argument as the ... in a function?
For instance:
List <- list(
data.frame('x'=c('a','b','c'), 'y'=1),
data.frame('x'=c('a','b','c'), 'z'=2)
)
would take
multimerge(List, by='x')
as an argument and give:
x y z
a 1 2
b 1 2
c 1 2
as output. But I do not want to write another version of multimerge.
purrr has a powerful function called flatten that would be perfect for this problem:
library(purrr)
multimerge <- function(..., by, all=T) {
value = flatten(list(...))
Reduce(function(x, y) merge(x, y, by=by, all=T), value)
}
No matter what is being fed into ..., flatten turns list(...) into a list of dataframes for Reduce. With this functionality, you can feed either a list of dataframes, several individual dataframes, both, or even several lists of dataframes.
You can also imitate the behavior of flatten by doing something like this in Base R:
multimerge <- function(..., by, all=T) {
value = list(...)
df_index = which(sapply(value, inherits, "data.frame"))
list_index = which(sapply(value, inherits, "list"))
value = c(value[df_index], unlist(value[list_index], recursive = FALSE))
Reduce(function(x, y) merge(x, y, by=by, all=T), value)
}
This applies unlist only to elements that are "lists" and keep dataframes untouched. Note that I used inherits instead of is.list, because dataframes are technically also lists!
Result:
> multimerge(List, by='x')
x y z
1 a 1 2
2 b 1 2
3 c 1 2
> multimerge(List[[1]], List[[2]], by='x')
x y z
1 a 1 2
2 b 1 2
3 c 1 2
> multimerge(List, List[[1]], List[[2]], by='x')
x y.x z.x y.y z.y
1 a 1 2 1 2
2 b 1 2 1 2
3 c 1 2 1 2
> multimerge(List, List, by='x')
x y.x z.x y.y z.y
1 a 1 2 1 2
2 b 1 2 1 2
3 c 1 2 1 2
Additional Notes:
From the documentation of ?flatten:
These functions remove a level hierarchy from a list. They are similar to unlist(), only ever remove a single layer of hierarchy, and are type-stable so you always know what the type of the output is.
The key word is "type-stability", meaning it always returns the same type of data structure.
> flatten(list(List, List[[1]], List[[2]]))
[[1]]
x y
1 a 1
2 b 1
3 c 1
[[2]]
x z
1 a 2
2 b 2
3 c 2
[[3]]
x y
1 a 1
2 b 1
3 c 1
[[4]]
x z
1 a 2
2 b 2
3 c 2
> unlist(list(List, List[[1]], List[[2]]), recursive = FALSE)
[[1]]
x y
1 a 1
2 b 1
3 c 1
[[2]]
x z
1 a 2
2 b 2
3 c 2
$x
[1] a b c
Levels: a b c
$y
[1] 1 1 1
$x
[1] a b c
Levels: a b c
$z
[1] 2 2 2
The main difference between flatten and unlist + recursive = FALSE is that flatten "unlists" only if the output matches the data structure of the rest, where as unlist + recursive = FALSE always flattens one level, so in my Base R example, I needed an extra step to check whether the element is a list or a dataframe.
So, the problem is that when you pass a list into multimerge the list gets put into another list, which then gets collapsed back into the original list. You could just do a check for superfluous length 1 lists, and strip off that level of lists:
multimerge <- function(..., by, all=T) {
value <- list(...)
if (length(value) == 1) value <- value[[1]]
Reduce(function(x,y)merge(x,y,by=by, all=all), value)
}
I would like to create a numeric indicator for a matrix such that for each unique element in one variable, it creates a sequence of the length based on the element in another variable. For example:
frame<- data.frame(x = c("a", "a", "a", "b", "b"), y = c(3,3,3,2,2))
frame
x y
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
The indicator, z, should look like this:
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Any and all help greatly appreciated. Thanks.
No ave?
frame$z <- with(frame, ave(y,x,FUN=seq_along) )
frame
# x y z
#1 a 3 1
#2 a 3 2
#3 a 3 3
#4 b 2 1
#5 b 2 2
A data.table version could be something like below (thanks to #mnel):
#library(data.table)
#frame <- as.data.table(frame)
frame[,z := seq_len(.N), by=x]
My original thought was to use:
frame[,z := .SD[,.I], by=x]
where .SD refers to each subset of the data.table split by x. .I returns the row numbers for an entire data.table. So, .SD[,.I] returns the row numbers within each group. Although, as #mnel points out, this is inefficient compared to the other method as the entire .SD needs to be loaded into memory for each group to run this calculation.
Another approach:
frame$z <- unlist(lapply(rle(as.numeric(frame[, "x"]))$lengths, seq_len))
library(dplyr)
frame %.%
group_by(x) %.%
mutate(z = seq_along(y))
You can split the data.frame on x, and generate a new id column based on that:
> frame$z <- unlist(lapply(split(frame, frame$x), function(x) 1:nrow(x)))
> frame
x y z
1 a 3 1
2 a 3 2
3 a 3 3
4 b 2 1
5 b 2 2
Or even more simply using data.table:
library(data.table)
frame <- data.table(frame)[,z:=1:nrow(.SD),by=x]
Try this where x is the column by which grouping is to be done and y is any numeric column. if there are no numeric columns use seq_along(x), say, in place of y:
transform(frame, z = ave(y, x, FUN = seq_along))
I would like to interweave two data.frame in R. For example:
a = data.frame(x=1:5, y=5:1)
b = data.frame(x=2:6, y=4:0)
I would like the result to look like:
> x y
1 5
2 4
2 4
3 3
3 3
...
obtained by cbinding x[1] with y[1], x[2] with y[2], etc.
What is the cleanest way to do this? Right now my solution involves spitting everthing out to a list and merging. This is pretty ugly:
lst = lapply(1:length(x), function(i) cbind(x[i,], y[i,]))
res = do.call(rbind, lst)
There is, of course, the interleave function in the "gdata" package:
library(gdata)
interleave(a, b)
# x y
# 1 1 5
# 6 2 4
# 2 2 4
# 7 3 3
# 3 3 3
# 8 4 2
# 4 4 2
# 9 5 1
# 5 5 1
# 10 6 0
You can do this by giving x and y an index, rbind them and sort by the index.
a = data.frame(x=1:5, y=5:1)
b = data.frame(x=2:6, y=4:0)
df <- rbind(data.frame(a, index = 1:nrow(a)), data.frame(b, index = 1:nrow(b)))
df <- df[order(df$index), c("x", "y")]
This is how I'd approach:
dat <- do.call(rbind.data.frame, list(a, b))
dat[order(dat$x), ]
do.call was unnecessary in the first step but makes the solution more extendable.
Perhaps this is cheating a bit, but the (non-exported) function interleave from ggplot2 is something I've stolen for my own uses before:
as.data.frame(mapply(FUN=ggplot2:::interleave,a,b))