How to combine long list of data frames in R - r

I have long list of data frames (e.g., 100s) with names d1,d2,d3,..d100. I want to combine them in r as df <- cbind(d1:d100)? is there any efficient way of combining them except writing all column names?

You could first pack all your data frames into a list and then cbind them using do.call. Here I am assuming your data frames are called d1, d2, ... and that they all have the same number of rows:
## Sample data:
d1 <- data.frame(A = 1:3, B = 4:6)
d2 <- data.frame(C = 7:9)
d3 <- data.frame(D = 10:12, E = 13:15)
## Put them into a list:
myList <- lapply(1:3, function(ii){get(paste0("d", ii))})
## Combine them into one big data frame:
myDataFrame <- do.call('cbind', myList)
myDataFrame
# A B C D E
# 1 1 4 7 10 13
# 2 2 5 8 11 14
# 3 3 6 9 12 15

Related

rbind data based on matching values in a column

I have several data frames I would like to combine, but I need to get rid of rows that don't have matching values in a column in the other data frames. For example, I want to merge a, b, and c data frames, based on the values in column x.
a <- data.frame(1:5, 5:9)
colnames(a) <- c("x", "y")
b <- data.frame(1:4, 7:10)
colnames(b) <- c("x", "y")
c <- data.frame(1:3, 6:8)
colnames(c) <- c("x", "y")
and have the result be
1 5
2 6
3 7
1 7
2 8
3 9
1 6
2 7
3 8
where the first three rows are from data frame a, the second three rows are from data frame b, and the third three rows are from data frame c, and the rows that didn't have matching values in column x were not included.
We create an index based on intersecting elements of 'x'
v1 <- Reduce(intersect, list(a$x, b$x, c$x))
rbind(a[a$x %in% v1,], b[b$x %in% v1,], c[c$x %in% v1, ])
# x y
#1 1 5
#2 2 6
#3 3 7
#4 1 7
#5 2 8
#6 3 9
#7 1 6
#8 2 7
#9 3 8
If there are many dataset objects, it is better to keep it in a list. Here, the example showed the object identifiers as completely different, but if the identifiers have a pattern e.g. df1, df2, ..df100 etc, it becomes easier to get it to a list
lst1 <- mget(ls(pattern = "^df\\d+$"))
If the object identifiers are all different xyz, abc, fq12 etc, but these are the only data.frame objects loaded in the global environment
lst1 <- mget(names(eapply(.GlobalEnv, 'is.data.frame')))
Then, get the interesecitng elements of the column 'x'
v1 <- Reduce(intersect, lapply(lst1, `[[`, "x"))
Use the intersecting vector to subset the rows of the list elements
do.call(rbind, lapply(lst1, function(x) dat[dat$x %in% v1,]))
Here, we assume the column names are the same across all the datasets
Another option is to do a merge and then unlist
out <- Reduce(function(...) merge(..., by = 'x'), list(a, b, c))
data.frame(x = out$x, y = unlist(out[-1], use.name = FALSE))

r - Change Variable Column Names in Function

Below is a function called change_names which works, but only on a specific data frame name. In short, I am having issues understanding how to manipulate the assign function so it can handle different data frame names.
The function basically changes the names on columns of files as I read them in a for loop. For example, one file could have a column name 'A' which should be 'X' while another file could have the column name 'D' which should also be named 'X'.
I have tried a few different outlets to actually change original data frame, 'tempPullList', but I need to be able to use the function on a different data frame.
#====example different files====
file1 <- data.frame(A = rep(1:10), Y = rep(c("Yellow","Red","Purpule","Green","Blue"), 2),
Z = rep(c("Drink", "Food"), 5))
file2 <- data.frame(D = rep(1:10), B = rep(c("Brown","Pink","Purpule","Green","Blue"), 2),
Z = rep(c("Drink", "Food"), 5))
file3 <- data.frame(X = rep(1:10), B = rep(c("Brown","Pink","Purpule","Green","Blue"), 2),
C = rep(c("Drink", "Food"), 5))
file_list <- list(file1, file2, file3)
#====Package Bank====
library(data.table)
library(dplyr)
#====Function====
change_names <- function(x){
#a list of columns to be renamed
#through out the files
chgCols <- c("A",
"B",
"C",
"D")
#the names the columns will be changed to
namekey <- c(A = "X",
B = "Y",
C = "Z",
D = "X")
chgCols <- match(chgCols, colnames(x)) #find any unwanted column indexes in data frame
chgCols <- colnames(x[, chgCols[!is.na(chgCols)]]) #match indexes to column names w/o NA's
x <- x %>% #rename associated columns
plyr::rename(namekey[chgCols]) #from 'namekey' in dataframe
assign('tempPullList', x, envir = .GlobalEnv)
}
#====Read in Files====
PullList <- data.frame()
for(file in 1:length(file_list)){
tempPullList <- data.frame(file_list[file])
print(file)
change_names(x = tempPullList)
PullList <- rbindlist(list(PullList, tempPullList),
fill = T)
}
Again, right now I am only able to do it when the data frame is called 'tempPullList' I need to be able to do it with another data frame.
i am pretty new to writing functions and especially assigning variables within functions. I would like this function to be as variable as possible. I am currently working on making chgCols and namekey to be inputs. So any advice on that as well would also be helpful
Example data:
column_name_lookup <- data.frame(orig = c("a","b","c","d"),
new = c("X","Y","z","X"),
stringsAsFactors = FALSE)
test_df <- data.frame(a = 1:5,
c = 2:6,
b = 3:7,
e = 4:8,
d = 5:9)
a c b e d
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
5 5 6 7 8 9
Code to change names:
new_names <- column_name_lookup$new[match(names(test_df),column_name_lookup$orig)]
names(test_df) <- ifelse(is.na(new_names),names(test_df),new_names)
X z Y e X
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
5 5 6 7 8 9

Sum of elements based on unique names across a list(unknown length) of data frames [duplicate]

This question already has answers here:
Aggregate variables in list of data frames into single data frame
(2 answers)
Closed 5 years ago.
I am trying to get the sum of elements based on unique names across a list containing unknown number of dataframes.
## Test Data
Name1 <- c("A","B","C","D")
Name2 <- c("A","D")
Name3 <- c("B","C","F")
Values1 <- c(1,2,3,4)
Values2 <- c(5,7)
Values3 <- c(6,8,9)
DF1 <- data.frame(Name1,Values1,stringsAsFactors = FALSE)
DF2 <- data.frame(Name2,Values2,stringsAsFactors = FALSE)
DF3 <- data.frame(Name3,Values3,stringsAsFactors = FALSE)
DFList <- list(DF1,DF2,DF3)
My Output will be:
A B C D F
6 8 11 11 9
I am not sure if using a loop is effective, since there can be any number of dataframes in the list and the number of unique rows in a dataframe can range anywhere between 100,000 to 1 Million.
Solution using data.table::rbindlist:
data.table::rbindlist(DFList)[, sum(Values1), Name1]
Name1 V1
1: A 6
2: B 8
3: C 11
4: D 11
5: F 9
rbindlist binds columns despite their names and then you can sum(Values1) by Name1.
sapply(split(unlist(lapply(DFList, "[[", 2)), unlist(lapply(DFList, "[[", 1))), sum)
# A B C D F
# 6 8 11 11 9
OR
aggregate(formula = Value~Name,
data = do.call(rbind, lapply(DFList, function(x) setNames(x, c("Name", "Value")))),
FUN = sum)
# Name Value
#1 A 6
#2 B 8
#3 C 11
#4 D 11
#5 F 9
Similar to the answer of #d.b.
lst <- unlist(lapply(DFList, function(DF) setNames(DF[[2]], DF[[1]])))
tapply(lst, names(lst), sum)
#A B C D F
#6 8 11 11 9

returning from list to data.frame after lapply

I have a very simply question about lapply. I am transitioning from STATA to R and I think there is some very basic concept that I am not getting about looping in R. But I have been reading about it all afternoon and can't figure out a reasonable way to do this very simple thing.
I have three data frames df1, df2, and df3 that all have the same column names, in the same order, etc.
I want to rename their columns all at once.
I put the data frames in a list:
dflist <- list(df1, df2, df3)
What I want the new names to be:
varlist <- c("newname1", "newname2", "newname3")
Write a function that replaces names with those in varlist, and lapply it over the data frames
ChangeNames <- function(x) {
names(x) <- varlist
return(x)
}
dflist <- lapply(dflist, ChangeNames)
So, as far as I understand, R has changed the names of the copies of the data frames that I put in the list, but not the original data frames themselves. I want the data frames themselves to be renamed, not the elements of the list (which are trapped in a list).
Now, I can go
df1 <- as.data.frame(dflist[1])
df2 <- as.data.frame(dflist[2])
df2 <- as.data.frame(dflist[3])
But that seems weird. You need a loop to get back the elements of a loop?
Basically: once you've put some data frames in a list and run your function on them via lapply, how do you get them back out of the list, without starting back at square one?
If you just want to change the names, that isn't too hard in R. Bear in mind that the assignment operator, <-, can be applied in sequence. Hence:
names(df1) <- names(df2) <- names(df3) <- c("newname1", "newname2", "newname3")
I am not sure I understand correctly, do you want to rename the columns of the data frames or the components of the list that contain the data frames?
If it is the first, please always search before asking, the question has been asked here.
So what you can easily do in case you have even more data frames in the list is:
# Creating some sample data first
> dflist <- list(df1 = data.frame(a = 1:3, b = 2:4, c = 3:5),
+ df2 = data.frame(a = 4:6, b = 5:7, c = 6:8),
+ df3 = data.frame(a = 7:9, b = 8:10, c = 9:11))
# See how it looks like
> dflist
$df1
a b c
1 1 2 3
2 2 3 4
3 3 4 5
$df2
a b c
1 4 5 6
2 5 6 7
3 6 7 8
$df3
a b c
1 7 8 9
2 8 9 10
3 9 10 11
# And do the trick
> dflist <- lapply(dflist, setNames, nm = c("newname1", "newname2", "newname3"))
# See how it looks now
> dflist
$df1
newname1 newname2 newname3
1 1 2 3
2 2 3 4
3 3 4 5
$df2
newname1 newname2 newname3
1 4 5 6
2 5 6 7
3 6 7 8
$df3
newname1 newname2 newname3
1 7 8 9
2 8 9 10
3 9 10 11
So the names were changed from a, b and c to newname1, newname2and newname3 for each data frame in the list.
If it is the second, you can do this:
> names(dflist) <- c("newname1", "newname2", "newname3")

Using lapply to change column names of a list of data frames

I'm trying to use lapply on a list of data frames; but failing at passing the parameters correctly (I think).
List of data frames:
df1 <- data.frame(A = 1:10, B= 11:20)
df2 <- data.frame(A = 21:30, B = 31:40)
listDF <- list(df1, df2,df3) #multiple data frames w. way less columns than the length of vector todos
Vector with columns names:
todos <-c('col1','col2', ......'colN')
I'd like to change the column names using lapply:
lapply (listDF, function(x) { colnames(x)[2:length(x)] <-todos[1:length(x)-1] } )
but this doesn't change the names at all. Am I not passing the data frames themselves, but something else? I just want to change names, not to return the result to a new object.
Thanks in advance, p.
You can also use setNames if you want to replace all columns
df1 <- data.frame(A = 1:10, B= 11:20)
df2 <- data.frame(A = 21:30, B = 31:40)
listDF <- list(df1, df2)
new_col_name <- c("C", "D")
lapply(listDF, setNames, nm = new_col_name)
## [[1]]
## C D
## 1 1 11
## 2 2 12
## 3 3 13
## 4 4 14
## 5 5 15
## 6 6 16
## 7 7 17
## 8 8 18
## 9 9 19
## 10 10 20
## [[2]]
## C D
## 1 21 31
## 2 22 32
## 3 23 33
## 4 24 34
## 5 25 35
## 6 26 36
## 7 27 37
## 8 28 38
## 9 29 39
## 10 30 40
If you need to replace only a subset of column names, then you can use the solution of #Jogo
lapply(listDF, function(df) {
names(df)[-1] <- new_col_name[-ncol(df)]
df
})
A last point, in R there is a difference between a:b - 1 and a:(b - 1)
1:10 - 1
## [1] 0 1 2 3 4 5 6 7 8 9
1:(10 - 1)
## [1] 1 2 3 4 5 6 7 8 9
EDIT
If you want to change the column names of the data.frame in global environment from a list, you can use list2env but I'm not sure it is the best way to achieve want you want. You also need to modify your list and use named list, the name should be the same as name of the data.frame you need to replace.
listDF <- list(df1 = df1, df2 = df2)
new_col_name <- c("C", "D")
listDF <- lapply(listDF, function(df) {
names(df)[-1] <- new_col_name[-ncol(df)]
df
})
list2env(listDF, envir = .GlobalEnv)
str(df1)
## 'data.frame': 10 obs. of 2 variables:
## $ A: int 1 2 3 4 5 6 7 8 9 10
## $ C: int 11 12 13 14 15 16 17 18 19 20
try this:
lapply (listDF, function(x) {
names(x)[-1] <- todos[-length(x)]
x
})
you will get a new list with changed dataframes. If you want to manipulate the listDF directly:
for (i in 1:length(listDF)) names(listDF[[i]])[-1] <- todos[-length(listDF[[i]])]
I was not able to get the code used in these answers to work. I found some code from another forum which did work. This will assign the new column names into each dataframe, the other methods created a copy of the dataframes. For anyone else here is the code.
# Create some dataframes
df1 <- data.frame(A = 1:10, B= 11:20)
df2 <- data.frame(A = 21:30, B = 31:40)
listDF <- c("df1", "df2") #Notice this is NOT a list
new_col_name <- c("C", "D") #What do you want the new columns to be named?
# Assign the new column names to each dataframe in "listDF"
for(df in listDF) {
df.tmp <- get(df)
names(df.tmp) <- new_col_name
assign(df, df.tmp)
}

Resources