Use object names within a list in lapply/ldply - r

In attempting to answer a question earlier, I ran into a problem that seemed like it should be simple, but I couldn't figure out.
If I have a list of dataframes:
df1 <- data.frame(a=1:3, x=rnorm(3))
df2 <- data.frame(a=1:3, x=rnorm(3))
df3 <- data.frame(a=1:3, x=rnorm(3))
df.list <- list(df1, df2, df3)
That I want to rbind together, I can do the following:
df.all <- ldply(df.list, rbind)
However, I want another column that identifies which data.frame each row came from. I expected to be able to use the deparse(substitute(x)) method (here and elsewhere) to get the name of the relevant data.frame and add a column. This is how I approached it:
fun <- function(x) {
name <- deparse(substitute(x))
x$id <- name
return(x)
}
df.all <- ldply(df.list, fun)
Which returns
a x id
1 1 1.1138062 X[[1L]]
2 2 -0.5742069 X[[1L]]
3 3 0.7546323 X[[1L]]
4 1 1.8358605 X[[2L]]
5 2 0.9107199 X[[2L]]
6 3 0.8313439 X[[2L]]
7 1 0.5827148 X[[3L]]
8 2 -0.9896495 X[[3L]]
9 3 -0.9451503 X[[3L]]
So obviously each element of the list does not contain the name I think it does. Can anyone suggest a way to get what I expected (shown below)?
a x id
1 1 1.1138062 df1
2 2 -0.5742069 df1
3 3 0.7546323 df1
4 1 1.8358605 df2
5 2 0.9107199 df2
6 3 0.8313439 df2
7 1 0.5827148 df3
8 2 -0.9896495 df3
9 3 -0.9451503 df3

Define your list with names and it should give you an .id column with the data.frame name
df.list <- list(df1=df1, df2=df2, df3=df3)
df.all <- ldply(df.list, rbind)
Output:
.id a x
1 df1 1 1.84658809
2 df1 2 -0.01177462
3 df1 3 0.58579469
4 df2 1 -0.64748756
5 df2 2 0.24384614
6 df2 3 0.59012676
7 df3 1 -0.63037679
8 df3 2 -1.17416295
9 df3 3 1.09349618
Then you can know the data.frame name from the column df.all$.id
Edit:
As per #Gary Weissman's comment if you want to generate the names automatically you can do
names(df.list) <- paste0('df',seq_along(df.list)

Using base only, one could try something like:
dd <- lapply(seq_along(df.list), function(x) cbind(df_name = paste0('df',x),df.list[[x]]))
do.call(rbind,dd)

In your definition, df.list does not have names, however, even then the deparse substitute idiom does not appear to work easilty (as lapply calls .Internal(lapply(X, FUN)) -- you would have to look at the source to see if the object name was available and how to get it
Something like
names(df.list) <- paste('df', 1:3, sep = '')
foo <- function(n, .list){
.list[[n]]$id <- n
.list[[n]]
}
a x id
1 1 0.8204213 a
2 2 -0.8881671 a
3 3 1.2880816 a
4 1 -2.2766111 b
5 2 0.3912521 b
6 3 -1.3963381 b
7 1 -1.8057246 c
8 2 0.5862760 c
9 3 0.5605867 c

if you want to use your function, instead of deparse(substitute(x)) use match.call(), and you want the second argument, making sure to convert it to character
name <- as.character(match.call()[[2]])

Related

changing column names of a data frame by changing values - R

Let I have the below data frame.
df.open<-c(1,4,5)
df.close<-c(2,8,3)
df<-data.frame(df.open, df.close)
> df
df.open df.close
1 1 2
2 4 8
3 5 3
I wanto change column names which includes "open" with "a" and column names which includes "close" with "b":
Namely I want to obtain the below data frame:
a b
1 1 2
2 4 8
3 5 3
I have a lot of such data frames. The pre values(here it is "df.") are changing but "open" and "close" are fix.
Thanks a lot.
We can create a function for reuse
f1 <- function(dat) {
names(dat)[grep('open$', names(dat))] <- 'a'
names(dat)[grep('close$', names(dat))] <- 'b'
dat
}
and apply on the data
df <- f1(df)
-output
df
a b
1 1 2
2 4 8
3 5 3
if these datasets are in a list
lst1 <- list(df, df)
lst1 <- lapply(lst1, f1)
Thanks to dear #akrun's insightful suggestion as always we can do it in one go. So we create character vectors in pattern and replacement arguments of str_replace to be able to carry out both operations at once. We can assign character vector of either length one or more to each one of them. In case of the latter the length of both vectors should correspond. More to the point as the documentation says:
References of the form \1, \2, etc will be replaced with the contents
of the respective matched group (created by ())
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_replace(., c(".*\\.open", ".*\\.close"), c("a", "b")))
a b
1 1 2
2 4 8
3 5 3
Another base R option using gsub + match + setNames
setNames(
df,
c("a", "b")[match(
gsub("[^open|close]", "", names(df)),
c("open", "close")
)]
)
gives
a b
1 1 2
2 4 8
3 5 3

merging more than 2 data frames whilst assigning an identifier factor in R

Take this very simple RWE, I want to know what package can be used to automatically assign a factor (preferable the data frame name) when we merge two or more data.frames
I have manually defined the factor in the example below and shown the desired output. But i want to automate it as I have over 100 tables to merge. Note that the headers within each df are constant, only the name itself changes
A <- 1:5
B <- 5:1
df1 <- data.frame(A,B)
A <- 2:6
B <- 6:2
df2 <- data.frame(A,B)
df1$ID <- rep("df1", 5)
df2$ID <- rep("df2", 5)
big_df <- rbind(df1,df2)
Assuming that your data.frame names follow a certain pattern like beginning with "df" followed by numbers and they are not inside a list but simply in your global environment, you can use the following:
library(data.table)
bigdf <- rbindlist(Filter(is.data.frame, mget(ls(pattern = "^df\\d+"))), id = "ID")
Without data.table, you could do it as follows:
lst <- Filter(is.data.frame, mget(ls(pattern = "^df\\d+")))
bigdf <- do.call(rbind, Map(function(df, id) transform(df, ID=id), lst, names(lst)))
Consider the following:
library(dplyr)
cof_df <- bind_rows(df1, df2, .id="ID")
cof_df
ID A B
1 1 1 5
2 1 2 4
3 1 3 3
4 1 4 2
5 1 5 1
6 2 2 6
7 2 3 5
8 2 4 4
9 2 5 3
10 2 6 2
And then:
cof_df$ID <- factor(cof_df$ID,
levels = c(1,2),
labels = paste0("df", unique(cof_df$ID)))
does the recoding.
A similar result can be obtained by naming the arguments in bind_rows, as in
cof_df <- bind_rows(df1=df1, df2=df2, .id="ID")
Another solution will be to use merge:
merged <- merge(df1, df2, all=TRUE, sort =FALSE)
> merged
A B ID
1 1 5 df1
2 2 4 df1
3 3 3 df1
4 4 2 df1
5 5 1 df1
6 2 6 df2
7 3 5 df2
8 4 4 df2
9 5 3 df2
10 6 2 df2

returning from list to data.frame after lapply

I have a very simply question about lapply. I am transitioning from STATA to R and I think there is some very basic concept that I am not getting about looping in R. But I have been reading about it all afternoon and can't figure out a reasonable way to do this very simple thing.
I have three data frames df1, df2, and df3 that all have the same column names, in the same order, etc.
I want to rename their columns all at once.
I put the data frames in a list:
dflist <- list(df1, df2, df3)
What I want the new names to be:
varlist <- c("newname1", "newname2", "newname3")
Write a function that replaces names with those in varlist, and lapply it over the data frames
ChangeNames <- function(x) {
names(x) <- varlist
return(x)
}
dflist <- lapply(dflist, ChangeNames)
So, as far as I understand, R has changed the names of the copies of the data frames that I put in the list, but not the original data frames themselves. I want the data frames themselves to be renamed, not the elements of the list (which are trapped in a list).
Now, I can go
df1 <- as.data.frame(dflist[1])
df2 <- as.data.frame(dflist[2])
df2 <- as.data.frame(dflist[3])
But that seems weird. You need a loop to get back the elements of a loop?
Basically: once you've put some data frames in a list and run your function on them via lapply, how do you get them back out of the list, without starting back at square one?
If you just want to change the names, that isn't too hard in R. Bear in mind that the assignment operator, <-, can be applied in sequence. Hence:
names(df1) <- names(df2) <- names(df3) <- c("newname1", "newname2", "newname3")
I am not sure I understand correctly, do you want to rename the columns of the data frames or the components of the list that contain the data frames?
If it is the first, please always search before asking, the question has been asked here.
So what you can easily do in case you have even more data frames in the list is:
# Creating some sample data first
> dflist <- list(df1 = data.frame(a = 1:3, b = 2:4, c = 3:5),
+ df2 = data.frame(a = 4:6, b = 5:7, c = 6:8),
+ df3 = data.frame(a = 7:9, b = 8:10, c = 9:11))
# See how it looks like
> dflist
$df1
a b c
1 1 2 3
2 2 3 4
3 3 4 5
$df2
a b c
1 4 5 6
2 5 6 7
3 6 7 8
$df3
a b c
1 7 8 9
2 8 9 10
3 9 10 11
# And do the trick
> dflist <- lapply(dflist, setNames, nm = c("newname1", "newname2", "newname3"))
# See how it looks now
> dflist
$df1
newname1 newname2 newname3
1 1 2 3
2 2 3 4
3 3 4 5
$df2
newname1 newname2 newname3
1 4 5 6
2 5 6 7
3 6 7 8
$df3
newname1 newname2 newname3
1 7 8 9
2 8 9 10
3 9 10 11
So the names were changed from a, b and c to newname1, newname2and newname3 for each data frame in the list.
If it is the second, you can do this:
> names(dflist) <- c("newname1", "newname2", "newname3")

Add new column to data.frame through loop in R

I have n number of data.frame i would like to add column to all data.frame
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
test=ls()
for (j in test){
j = cbind(get(j),IssueType=j)
}
Problem that i'm running into is
j = cbind(get(j),IssueType=j)
because it assigns all the data to j instead of a, b.
As commented, it's mostly better to keep related data in a list structure. If you already have the data.frames in your global environment and you want to get them into a list, you can use:
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
This is from here and makes sure that you only get data.frame objects from your global environment.
You will notice that you now already have a named list:
> dflist
# $a
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
#
# $b
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
So you can easily select the data you want by typing for example
dflist[["a"]]
If you still want to create extra columns, you could do it like this:
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
Now, each data.frame in dflist has a new column called IssueType:
> dflist
# $a
# X1.4 X5.8 IssueType
# 1 1 5 a
# 2 2 6 a
# 3 3 7 a
# 4 4 8 a
#
# $b
# X1.4 X5.8 IssueType
# 1 1 5 b
# 2 2 6 b
# 3 3 7 b
# 4 4 8 b
In the future, you can create the data inside a list from the beginning, i.e.
dflist <- list(
a = data.frame(1:4,5:8)
b = data.frame(1:4, 5:8)
)
To create a list of your data.frames do this:
a <- data.frame(1:4,5:8); b <- data.frame(1:4, 5:8); test <- list(a,b)
This allows you to us the lapply function to perform whatever you like to do with each of the dataframes, eg:
out <- lapply(test, function(x) cbind(j))
For most data.frame operations I recommend using the packages dplyr and tidyr.
wooo wooo
here is answer for the issue
helped by #docendo discimus
Created Dataframe
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
Group data.frame into list
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
Add's extra column
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
unstinting the data frame
list2env(dflist ,.GlobalEnv)

Replace values in selected columns by passing column name of data.frame into apply() or plyr function

Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.

Resources