I am trying to write a multi-merge alternative to merge which can merge-together more than two datasets on a single key.
The code I have is like this:
multimerge <- function(..., by, all=T) {
value <- list(...)
Reduce(function(x,y)merge(x,y,by=by, all=all), value)
}
But the thing I want to multi-merge is a list. Is it possible to pass a list argument as the ... in a function?
For instance:
List <- list(
data.frame('x'=c('a','b','c'), 'y'=1),
data.frame('x'=c('a','b','c'), 'z'=2)
)
would take
multimerge(List, by='x')
as an argument and give:
x y z
a 1 2
b 1 2
c 1 2
as output. But I do not want to write another version of multimerge.
purrr has a powerful function called flatten that would be perfect for this problem:
library(purrr)
multimerge <- function(..., by, all=T) {
value = flatten(list(...))
Reduce(function(x, y) merge(x, y, by=by, all=T), value)
}
No matter what is being fed into ..., flatten turns list(...) into a list of dataframes for Reduce. With this functionality, you can feed either a list of dataframes, several individual dataframes, both, or even several lists of dataframes.
You can also imitate the behavior of flatten by doing something like this in Base R:
multimerge <- function(..., by, all=T) {
value = list(...)
df_index = which(sapply(value, inherits, "data.frame"))
list_index = which(sapply(value, inherits, "list"))
value = c(value[df_index], unlist(value[list_index], recursive = FALSE))
Reduce(function(x, y) merge(x, y, by=by, all=T), value)
}
This applies unlist only to elements that are "lists" and keep dataframes untouched. Note that I used inherits instead of is.list, because dataframes are technically also lists!
Result:
> multimerge(List, by='x')
x y z
1 a 1 2
2 b 1 2
3 c 1 2
> multimerge(List[[1]], List[[2]], by='x')
x y z
1 a 1 2
2 b 1 2
3 c 1 2
> multimerge(List, List[[1]], List[[2]], by='x')
x y.x z.x y.y z.y
1 a 1 2 1 2
2 b 1 2 1 2
3 c 1 2 1 2
> multimerge(List, List, by='x')
x y.x z.x y.y z.y
1 a 1 2 1 2
2 b 1 2 1 2
3 c 1 2 1 2
Additional Notes:
From the documentation of ?flatten:
These functions remove a level hierarchy from a list. They are similar to unlist(), only ever remove a single layer of hierarchy, and are type-stable so you always know what the type of the output is.
The key word is "type-stability", meaning it always returns the same type of data structure.
> flatten(list(List, List[[1]], List[[2]]))
[[1]]
x y
1 a 1
2 b 1
3 c 1
[[2]]
x z
1 a 2
2 b 2
3 c 2
[[3]]
x y
1 a 1
2 b 1
3 c 1
[[4]]
x z
1 a 2
2 b 2
3 c 2
> unlist(list(List, List[[1]], List[[2]]), recursive = FALSE)
[[1]]
x y
1 a 1
2 b 1
3 c 1
[[2]]
x z
1 a 2
2 b 2
3 c 2
$x
[1] a b c
Levels: a b c
$y
[1] 1 1 1
$x
[1] a b c
Levels: a b c
$z
[1] 2 2 2
The main difference between flatten and unlist + recursive = FALSE is that flatten "unlists" only if the output matches the data structure of the rest, where as unlist + recursive = FALSE always flattens one level, so in my Base R example, I needed an extra step to check whether the element is a list or a dataframe.
So, the problem is that when you pass a list into multimerge the list gets put into another list, which then gets collapsed back into the original list. You could just do a check for superfluous length 1 lists, and strip off that level of lists:
multimerge <- function(..., by, all=T) {
value <- list(...)
if (length(value) == 1) value <- value[[1]]
Reduce(function(x,y)merge(x,y,by=by, all=all), value)
}
Related
To start: I've seen this post and no, tidyr's unnest doesn't work here. I am doing an lapply where the returning function returns a list with named entries (see example func at the bottom for clarity):
ls <- lapply(x, func)
Now if I look at ls, it is a list of lists, and in the R studio data viewer it appears as having Name, Type, and Value columns.
Now, if I use
df <- bind_rows(ls)
I get exactly what I want, except I then need to bind the dataframe containing x to df. This is the problem, because for each x, func will return a variable number of rows, which means I need to run an equivalent of bind_rows after I have already attached ls to my dataframe.
An example is as below:
func <- function(x){
res <- list()
res$name <- 1:x
res$val <- 1:x
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- lapply(df$nums, func)
bind_rows(ls) gives:
name val
<int> <int>
1 1 1
2 1 1
3 2 2
4 1 1
5 2 2
6 3 3
and the desired output is:
name val nums letters
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 2 B
3 2 2 2 B
4 1 1 3 C
5 2 2 3 C
6 3 3 3 C
Note that func here creates n rows given x = n. This is not the case for my actual function. func(n) can produce any positive number of rows.
Maybe you're looking for something more "canned", but you could write a function that would produce the desired output like this:
out <- function(data, varname){
l <- lapply(data[[varname]], func)
l <- lapply(1:length(l), function(x)do.call(data.frame, c(l[[x]], zz_obs=x)))
l <- do.call(rbind, l)
data$zz_obs <- 1:nrow(data)
if(!all(data$obs %in% l$obs))warning("Not all rows of data in output\n")
data <- dplyr::full_join(l, data, by="zz_obs")
data[,-which(names(data) == "zz_obs")]
}
out(df, "nums")
# name val nums letters
# 1 1 1 1 A
# 2 1 1 2 B
# 3 2 2 2 B
# 4 1 1 3 C
# 5 2 2 3 C
# 6 3 3 3 C
You can try mapply which is similar to lapply, but allows multiple vectors or lists to be passed to iterate over their values:
library(dplyr)
func <- function(x, y){
res <- list()
res$name <- 1:x
res$val <- 1:x
res$let <- rep(y, x)
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- mapply(
func,
x = df$nums,
y =df$letters,
SIMPLIFY = FALSE
)
bind_rows(ls)
# A tibble: 6 x 3
# name val let
# <int> <int> <chr>
# 1 1 1 A
# 2 1 1 B
# 3 2 2 B
# 4 1 1 C
# 5 2 2 C
# 6 3 3 C
In the interim, the function I will be using to do this is:
merge_and_flatten <- function(x, y){
for (i in 1:nrow(x)){
y[[i]][names(x)] <- lapply(x[i, ], rep, times = length(y[[i]][[1]]))
}
return(bind_rows(y))
}
This is the cleanest solution I could come up with. Here, x serves and my df, and y serves as ls. It works by reducing the problem to bind_rows: it simply adds elements to ls which contain the columns in x. I absolutely want a cleaner solution, but this works for anyone who needs it.
I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call
Suppose I have a list of data.frames:
list <- list(A=data.frame(x=c(1,2),y=c(3,4)), B=data.frame(x=c(1,2),y=c(7,8)))
I want to combine them into one data.frame like this:
data.frame(x=c(1,2,1,2), y=c(3,4,7,8), group=c("A","A","B","B"))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
I can do in this way:
add_group_name <- function(df, group) {
df$group <- group
df
}
Reduce(rbind, mapply(add_group_name, list, names(list), SIMPLIFY=FALSE))
But I want to know if it's possible to get the name inside the lapply loop without the use of names(list), just like this:
add_group_name <- function(df) {
df$group <- ? #How to get the name of df in the list here?
}
Reduce(rbind, lapply(list, add_group_name))
I renamed list to listy to remove the clash with the base function. This is a variation on SeƱor O's answer in essence:
do.call(rbind, Map("[<-", listy, TRUE, "group", names(listy) ) )
# x y group
#A.1 1 3 A
#A.2 2 4 A
#B.1 1 7 B
#B.2 2 8 B
This is also very similar to a previous question and answer here: r function/loop to add column and value to multiple dataframes
The inner Map part gives this result:
Map("[<-", listy, TRUE, "group", names(listy) )
#$A
# x y group
#1 1 3 A
#2 2 4 A
#
#$B
# x y group
#1 1 7 B
#2 2 8 B
...which in long form, for explanation's sake, could be written like:
Map(function(data, nms) {data[TRUE,"group"] <- nms; data;}, listy, names(listy) )
As #flodel suggests, you could also use R's built in transform function for updating dataframes, which may be simpler again:
do.call(rbind, Map(transform, listy, group = names(listy)) )
I think a much easier approach is:
> do.call(rbind, lapply(names(list), function(x) data.frame(list[[x]], group = x)))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
Using plyr:
ldply(ll)
.id x y
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 8
Or in 2 steps :
xx <- do.call(rbind,ll)
xx$group <- sub('([A-Z]).*','\\1',rownames(xx))
xx
x y group
A.1 1 3 A
A.2 2 4 A
B.1 1 7 B
B.2 2 8 B
I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.
Looking through the ave function, I found a remarkable line:
split(x, g) <- lapply(split(x, g), FUN) # From ave
Interestingly, this line changes the value of x, which I found unexpected. I expected that split(x,g) would result in a list, which could be assigned to, but discarded afterward. My question is, why does the value of x change?
Another example may explain better:
a <- data.frame(id=c(1,1,2,2), value=c(4,5,7,6))
# id value
# 1 1 4
# 2 1 5
# 3 2 7
# 4 2 6
split(a,a$id) # Split a row-wise by id into a list of size 2
# $`1`
# id value
# 1 1 4
# 2 1 5
# $`2`
# id value
# 3 2 7
# 4 2 6
# Find the row with highest value for each id
lapply(split(a,a$id),function(x) x[which.max(x$value),])
# $`1`
# id value
# 2 1 5
# $`2`
# id value
# 3 2 7
# Assigning to the split changes the data.frame a!
split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),])
a
# id value
# 1 1 5
# 2 1 5
# 3 2 7
# 4 2 7
Not only has a changed, but it changed to a value that does not look like the right hand side of the assignment! Even if assigning to split(a,a$id) somehow changes a (which I don't understand), why does it result in a data.frame instead of a list?
Note that I understand that there are better ways to accomplish this task. My question is why does split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),]) change a?
The help page for split says in its header: "The replacement forms replace values corresponding to such a division." So it really should not be unexpected, although I admit it is not widely used. I do not understand how your example illustrates that the assigned values "do not look like the RHS of the assignment!". The max values are assigned to the 'value' lists within categories defined by the second argument factor.
(I do thank you for the question. I had not realized that split<- was at the core of ave. I guess it is more widely used than I realized, since I think ave is a wonderfully useful function.)
Just after definition of a, perform split(a, a$id)=1, the result would be:
> a
id value
1 1 1
2 1 1
3 1 1
4 1 1
The key here is that split<- actually modified the LHS with RHS values.
Here's an example:
> x <- c(1,2,3);
> split(x,x==2)
$`FALSE`
[1] 1 3
$`TRUE`
[1] 2
> split(x,x==2) <- split(c(10,20,30),c(10,20,30)==20)
> x
[1] 10 20 30
Note the line where I re-assign split(x,x==2) <- . This actually reassigns x.
As the comments below have stated, you can look up the definition of split<- like so
> `split<-.default`
function (x, f, drop = FALSE, ..., value)
{
ix <- split(seq_along(x), f, drop = drop, ...)
n <- length(value)
j <- 0
for (i in ix) {
j <- j%%n + 1
x[i] <- value[[j]]
}
x
}
<bytecode: 0x1e18ef8>
<environment: namespace:base>