I have a list of dataframes on which I want to change the name of the 2nd column in each dataframe in the list so that it matches the name of the list item that holds it. The code that I have at the moment is:
my_list <- list(one = data.frame(a <- 1:5, b <- 1:5), two = data.frame(a <- 1:5, b <- 1:5))
my_list <- lapply(seq_along(names(my_list)), function(x) names(my_list[[x]])[2] <- names(my_list)[x])
but my code just replaces the dataframes without me understanding why. Any help would be much appreciated.
I know that I can do this easily with a "for" loop, but I would like to avoid it, hence my question.
setNames can be convenient here:
my_list2 <- lapply(
names(my_list),
function(x) setNames(my_list[[x]], c(names(my_list[[x]])[1], x))
)
Or the same using Map (which I think is easier to read):
my_list2 <- Map(
function(x, n) setNames(x, c(names(x)[1], n)),
my_list, names(my_list)
)
> my_list2
$one
a one
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
$two
a two
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
A problem of names<- is that it returns the name, not the object.
Try this, loop through data.frames, update column name:
# dummy list
my_list <- list(one = data.frame(a = 1:5, b = 1:5), two = data.frame(a = 1:5, b = 1:5))
my_list_updated <-
lapply(names(my_list), function(i){
x <- my_list[[ i ]]
# set 2nd column to a new name
names(x)[2] <- i
# return
x
})
my_list_updated
# [[1]]
# a one
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
#
# [[2]]
# a two
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
You could to use the super assignment operator:
my_list <- list(one = data.frame(a = 1:5, b = 1:5), two = data.frame(a = 1:5, b = 1:5))
lapply(seq_along(names(my_list)), function(x) names(my_list[[x]])[2] <<- names(my_list)[x] )
my_list
Related
To start: I've seen this post and no, tidyr's unnest doesn't work here. I am doing an lapply where the returning function returns a list with named entries (see example func at the bottom for clarity):
ls <- lapply(x, func)
Now if I look at ls, it is a list of lists, and in the R studio data viewer it appears as having Name, Type, and Value columns.
Now, if I use
df <- bind_rows(ls)
I get exactly what I want, except I then need to bind the dataframe containing x to df. This is the problem, because for each x, func will return a variable number of rows, which means I need to run an equivalent of bind_rows after I have already attached ls to my dataframe.
An example is as below:
func <- function(x){
res <- list()
res$name <- 1:x
res$val <- 1:x
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- lapply(df$nums, func)
bind_rows(ls) gives:
name val
<int> <int>
1 1 1
2 1 1
3 2 2
4 1 1
5 2 2
6 3 3
and the desired output is:
name val nums letters
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 2 B
3 2 2 2 B
4 1 1 3 C
5 2 2 3 C
6 3 3 3 C
Note that func here creates n rows given x = n. This is not the case for my actual function. func(n) can produce any positive number of rows.
Maybe you're looking for something more "canned", but you could write a function that would produce the desired output like this:
out <- function(data, varname){
l <- lapply(data[[varname]], func)
l <- lapply(1:length(l), function(x)do.call(data.frame, c(l[[x]], zz_obs=x)))
l <- do.call(rbind, l)
data$zz_obs <- 1:nrow(data)
if(!all(data$obs %in% l$obs))warning("Not all rows of data in output\n")
data <- dplyr::full_join(l, data, by="zz_obs")
data[,-which(names(data) == "zz_obs")]
}
out(df, "nums")
# name val nums letters
# 1 1 1 1 A
# 2 1 1 2 B
# 3 2 2 2 B
# 4 1 1 3 C
# 5 2 2 3 C
# 6 3 3 3 C
You can try mapply which is similar to lapply, but allows multiple vectors or lists to be passed to iterate over their values:
library(dplyr)
func <- function(x, y){
res <- list()
res$name <- 1:x
res$val <- 1:x
res$let <- rep(y, x)
return(res)
}
df <- data.frame(nums <- c(1:3), letters <- c("A", "B", "C"))
ls <- mapply(
func,
x = df$nums,
y =df$letters,
SIMPLIFY = FALSE
)
bind_rows(ls)
# A tibble: 6 x 3
# name val let
# <int> <int> <chr>
# 1 1 1 A
# 2 1 1 B
# 3 2 2 B
# 4 1 1 C
# 5 2 2 C
# 6 3 3 C
In the interim, the function I will be using to do this is:
merge_and_flatten <- function(x, y){
for (i in 1:nrow(x)){
y[[i]][names(x)] <- lapply(x[i, ], rep, times = length(y[[i]][[1]]))
}
return(bind_rows(y))
}
This is the cleanest solution I could come up with. Here, x serves and my df, and y serves as ls. It works by reducing the problem to bind_rows: it simply adds elements to ls which contain the columns in x. I absolutely want a cleaner solution, but this works for anyone who needs it.
I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call
I have n number of data.frame i would like to add column to all data.frame
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
test=ls()
for (j in test){
j = cbind(get(j),IssueType=j)
}
Problem that i'm running into is
j = cbind(get(j),IssueType=j)
because it assigns all the data to j instead of a, b.
As commented, it's mostly better to keep related data in a list structure. If you already have the data.frames in your global environment and you want to get them into a list, you can use:
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
This is from here and makes sure that you only get data.frame objects from your global environment.
You will notice that you now already have a named list:
> dflist
# $a
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
#
# $b
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
So you can easily select the data you want by typing for example
dflist[["a"]]
If you still want to create extra columns, you could do it like this:
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
Now, each data.frame in dflist has a new column called IssueType:
> dflist
# $a
# X1.4 X5.8 IssueType
# 1 1 5 a
# 2 2 6 a
# 3 3 7 a
# 4 4 8 a
#
# $b
# X1.4 X5.8 IssueType
# 1 1 5 b
# 2 2 6 b
# 3 3 7 b
# 4 4 8 b
In the future, you can create the data inside a list from the beginning, i.e.
dflist <- list(
a = data.frame(1:4,5:8)
b = data.frame(1:4, 5:8)
)
To create a list of your data.frames do this:
a <- data.frame(1:4,5:8); b <- data.frame(1:4, 5:8); test <- list(a,b)
This allows you to us the lapply function to perform whatever you like to do with each of the dataframes, eg:
out <- lapply(test, function(x) cbind(j))
For most data.frame operations I recommend using the packages dplyr and tidyr.
wooo wooo
here is answer for the issue
helped by #docendo discimus
Created Dataframe
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
Group data.frame into list
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
Add's extra column
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
unstinting the data frame
list2env(dflist ,.GlobalEnv)
Suppose I have a list of data.frames:
list <- list(A=data.frame(x=c(1,2),y=c(3,4)), B=data.frame(x=c(1,2),y=c(7,8)))
I want to combine them into one data.frame like this:
data.frame(x=c(1,2,1,2), y=c(3,4,7,8), group=c("A","A","B","B"))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
I can do in this way:
add_group_name <- function(df, group) {
df$group <- group
df
}
Reduce(rbind, mapply(add_group_name, list, names(list), SIMPLIFY=FALSE))
But I want to know if it's possible to get the name inside the lapply loop without the use of names(list), just like this:
add_group_name <- function(df) {
df$group <- ? #How to get the name of df in the list here?
}
Reduce(rbind, lapply(list, add_group_name))
I renamed list to listy to remove the clash with the base function. This is a variation on SeƱor O's answer in essence:
do.call(rbind, Map("[<-", listy, TRUE, "group", names(listy) ) )
# x y group
#A.1 1 3 A
#A.2 2 4 A
#B.1 1 7 B
#B.2 2 8 B
This is also very similar to a previous question and answer here: r function/loop to add column and value to multiple dataframes
The inner Map part gives this result:
Map("[<-", listy, TRUE, "group", names(listy) )
#$A
# x y group
#1 1 3 A
#2 2 4 A
#
#$B
# x y group
#1 1 7 B
#2 2 8 B
...which in long form, for explanation's sake, could be written like:
Map(function(data, nms) {data[TRUE,"group"] <- nms; data;}, listy, names(listy) )
As #flodel suggests, you could also use R's built in transform function for updating dataframes, which may be simpler again:
do.call(rbind, Map(transform, listy, group = names(listy)) )
I think a much easier approach is:
> do.call(rbind, lapply(names(list), function(x) data.frame(list[[x]], group = x)))
x y group
1 1 3 A
2 2 4 A
3 1 7 B
4 2 8 B
Using plyr:
ldply(ll)
.id x y
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 8
Or in 2 steps :
xx <- do.call(rbind,ll)
xx$group <- sub('([A-Z]).*','\\1',rownames(xx))
xx
x y group
A.1 1 3 A
A.2 2 4 A
B.1 1 7 B
B.2 2 8 B
Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2