extracting a dataframe from a list over many objects - r

I have over a 1000 objects (z) in R, each containing three dataframes (df1, df2, df3) with different structures.
z1$df1 … z1000$df1
z1$df2 … z1000$df2
z1$df3 … z1000$df3
I created a list of these objects (list1 thus contains z1 thru z1000) and tried to use lapply to extract one type of dataframe (df2) for all objects, and then merge them to one single dataframe.
Extraction:
For a single object it would look like this:
df15<- z15$df2 # I transferred the index of z to the extracted df
I tried some code with lapply, ignoring the transfer of the index (I can create another list for that). However I don’t know what function I should use.
List2 <- lapply(list1, function(x))
I try to avoid using a loop because there's so many and vectorization is so much quicker. I have the idea I'm looking at it from the wrong angle.
Subsequent merging can be done as follows:
merged <- do.call(rbind, list2)
Thanks for any suggestions.

It sounds like you want to pull out all the df1s and rbind them together then do the same for the other dataframes. You can use purrr::map_dfr to extract a column from each element of the list and rowbind them together.
library('tidyverse')
dummy_df <- list(
df1 = iris,
df2 = cars,
df3 = CO2)
list1 <- list(
z1 = dummy_df,
z2 = dummy_df,
z3 = dummy_df)
df1 <- map_dfr(list1, 'df1')
df2 <- map_dfr(list1, 'df2')
df3 <- map_dfr(list1, 'df3')
If you wanted to do it in base R, you can use lapply.
df1 <- lapply(list1, function(x) x$df1)
df1_merged <- do.call(rbind, df1)

One option could be using lapply to extract data.frame and then use bind_rows from dplyr.
## The data
df1 <- data.frame(id = c(1:10), name = c(LETTERS[1:10]), stringsAsFactors = FALSE)
df2 <- data.frame(id = 11:20, name = LETTERS[11:20], stringsAsFactors = FALSE)
df3 <- data.frame(id = 21:30, name = LETTERS[15:24], stringsAsFactors = FALSE)
df4 <- data.frame(id = 121:130, name = LETTERS[15:24], stringsAsFactors = FALSE)
z1 <- list(df1 = df1, df2 = df2, df3 = df3)
z2 <- list(df1 = df1, df2 = df2, df3 = df3)
z3 <- list(df1 = df1, df2 = df2, df3 = df3)
z4 <- list(df1 = df1, df2 = df2, df3 = df4) #DFs can contain different data
# z <- list(z1, z2, z3, z4)
# Dynamically populate list z with many list object
z <- as.list(mget(paste("z",1:4,sep="")))
df1_all <- bind_rows(lapply(z, function(x) x$df1))
df2_all <- bind_rows(lapply(z, function(x) x$df2))
df3_all <- bind_rows(lapply(z, function(x) x$df3))
## Result for df3_all
> tail(df3_all)
## id name
## 35 125 S
## 36 126 T
## 37 127 U
## 38 128 V
## 39 129 W
## 40 130 X

Try this:
lapply(list1, "[[", "df2")
or if you want to rbind them together:
do.call("rbind", lapply(list1, "[[", "df2"))
The row names in the resulting data frame will identify the origin of each row.
No packages are used.
Note
We can use this input to test the code above. BOD is a built-in data frame:
z <- list(df1 = BOD, df2 = BOD, df3 = BOD)
list1 <- list(z1 = z, z2 = z)

THere's also data.table::rbindlist, which is likely faster than do.call(rbind, lapply(...)) or dplyr::bind_rows
library(data.table)
rbindlist(lapply(list1, "[[", "df2"))

Related

Apply function on multiple lists in R

I have four lists each with multiple data frames.
I need to apply the same function on the lists.
How can I do this?
Sample data:
df1 <- data.frame(x = 1:3, y = letters[1:3])
df2 <- data.frame(x = 4:6, y = letters[4:6])
df3 <- data.frame(x = 7:9, y = letters[7:9])
df4 <- data.frame(x = 10:12, y = letters[10:12])
list1 <- list(df1,df2)
list2 <- list(df3,df4)
In my real data I import based on a pattern in the filename and thus my list elements will have the following names (sample data):
names(list1) <- c("./1. Data/df1.csv", "./1. Data/df2.csv")
names(list2) <- c("./1. Data/df3.csv", "./1. Data/df4.csv")
And this is one of the functions I want to run on all lists.
element.name <- function(x) {
all_filenames <- names(x) %>%
basename() %>%
as.list()
names(x) <- all_filenames
names(x) <- gsub("\\.csv", "", names(x))
}
which will give the desired output
names(list1) <- element.name(list1)
names(list1)
[1] "df1" [2] "df2"
I've tried using a for loop but I end up overwriting my output, so I hope some of you can help me out, since I need to run a lot of functions on my lists.
You could create a list of your lists, and then use lapply to apply to every list the function element.name. You can use setNames to avoid problems linked the assignment on names. You can then use list2env to get your data.frames back to the global environment.
setNames(list(list1, list2), c('list1', 'list2')) |>
lapply(function(x) setNames(x, element.name(x))) |>
list2env()
output
> list1
$df1
x y
1 1 a
2 2 b
3 3 c
$df2
x y
1 4 d
2 5 e
3 6 f
> list2
$df3
x y
1 7 g
2 8 h
3 9 i
$df4
x y
1 10 j
2 11 k
3 12 l
Here is an approach using data.table::fread
library(data.table)
# create dummy CSVs -------------------------------------------------------
DT1 <- data.frame(x = 1:3, y = letters[1:3])
DT2 <- data.frame(x = 4:6, y = letters[4:6])
DT3 <- data.frame(x = 7:9, y = letters[7:9])
DT4 <- data.frame(x = 10:12, y = letters[10:12])
mapply(write.csv, x = list(DT1, DT2, DT3, DT4), file = list("DT1.csv", "DT2.csv", "DT3.csv", "DT4.csv"), row.names = FALSE)
# read in CSVs ------------------------------------------------------------
csv_paths <- list.files(path = ".", pattern = ".csv$")
# might need to split this into different steps due to different csv formats?
DT_list <- setNames(lapply(csv_paths, fread), tools::file_path_sans_ext(basename(csv_paths)))
# apply a function to each data.table -------------------------------------
lapply(DT_list, function(DT){DT[, test := x*2]})
If you want to stick with the given dummy data just merge the lists:
list1 <- list(df1,df2)
list2 <- list(df3,df4)
DT_list <- setNames(c(list1, list2), tools::file_path_sans_ext(basename(csv_paths)))

Coerce specific column to "double" within a dataframe list

Let's say I have a list of dataframes
myList <- list(df1 = data.frame(A = as.character(sample(10)), B =
rep(1:2, 10)), df2 = data.frame(A = as.character(sample(10)), B = rep(1:2, 10)) )
I want to coerce column A in each dataframe to double.
I'm trying:
myList = sapply(myList,simplify = FALSE, function(x){
x$A <- as.double(x$A) })
But this returns the coerced values, not even column with column names.
I also tried with dplyr and mutate_if, but with no success
We can use lapply with transform in base R
myList2 <- lapply(myList, transform, A = as.double(A))
Or use map with mutate from tidyverse
library(dplyr)
library(purrr)
myList2 <- map(myList, ~ .x %>%
mutate(A = as.double(A)))
The issue in the OP's code is that it is not returning the data i.e. 'x'.
myList2 <- sapply(myList, simplify = FALSE,
function(x){
x$A <- as.double(x$A)
x
})

How do I put multiple data frames having different dimensions into a single list

For example, I have three dataframes: df1,df2, and df3 like this:
df1 <- data.frame(x = c(1:5), y = c(11:15))
df2 <- data.frame(x = c(1:9), y = c(11:19), z = c(8:16) )
df3 <- data.frame(x = c(2:5), y = c(11:14), z = c(3:6), g = c(4:7))
As usual, we can manually put them into a single list like this:
mylist <- list(A = df1, B = df2, C= df3)
Problem:
But right now, I have 1075 data frames having different dimensions, and I am stuck in putting them into a single list.
What I have:
Recently, I have a file (ICON.RData) can put the 1075 data frames into R environment directly. More specifically, this file includes those 1075 data frames + 1 data frame describing information of the 1075 dataframes, in which its first column (named var_name) comprising name of each of 1075 data frames. So, we easily have a name vector of all the data frame like this
name <- ICON$var_name
Can Anyone help me?
mylist <- mget( ICON$var_name)
will put all the data frames into a list called mylist
Assuming we have icon.RData shown in the Note at the end we can load it into an environment iconEnv and then convert that environment to a list iconList or possibly just use iconEnv and don't bother converting it to a list in which case we can omit the last two lines of code.
load("icon.RData", envir = iconEnv <- new.env())
iconList <- as.list(iconEnv)
rm(iconEnv) # optional
Note
Input file in reproducible form:
df1 <- data.frame(x = 1:5, y = 11:15)
df2 <- data.frame(x = 1:9, y = 11:19, z = 8:16 )
df3 <- data.frame(x = 2:5, y = 11:14, z = 3:6, g = 4:7)
save(df1, df2, df3, file = "icon.RData")
rm(df1, df2, df3)

Joining data frames without returning all matching combinations

I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)

Hierarchical data to be arranged

Hierarchical data is to be arranged.
Imagine that df is given. (Here i generate df with some fake data )
df1 <-data.frame("Col1" = rep("a",8), "Col2"= c(rep("M",3),rep("N",2), rep("O", 2), rep("P",1)), "Col3" = LETTERS[1:8])
df2 <-data.frame("Col1" = rep("b",13), "Col2"= c(rep("p",4),rep("q",5),rep("r",3),rep("s",1)), "Col3" = LETTERS[1:13])
df <- rbind(df1,df2)
For each element of Col1, we have to get a collection in sorted way
Finally, what i look for is a list of lists :
list a : (1,2,2,3)
list b : (1,3,4,5)
ll <- split(df, df$Col1)
lapply(ll, function(dat){
v <- Filter(function(v) !is.na(v), with(dat, tapply(Col1, Col2, length)))
v[order(v)]
})

Resources