R extract matching values from list of data frames - r

I have a relatively large amount of data stored in a list of data frames with several columns.
For each element of the list I wish to check one column against a reference and if present extract the value held in another column of the same element and place in a new summary matrix.
e.g. with the following example code:
add1 = c("N1","N1","N1")
coords1 = c(1,2,3)
vals1 = c("a","b","c")
extra1 = c("x","y","x")
add2 = c("N2","N2","N2","N2")
coords2 = c(2,3,4,5)
vals2 = c("b","c","d","e")
extra2 = c("z","y","x","x")
add3 = c("N3","N3","N3")
coords3 = c(1,3,5)
vals3 = c("a","c","e")
extra3 = c("z","z","x")
df1 <- data.frame(add1, coords1, vals1, extra1)
df2 <- data.frame(add2, coords2, vals2, extra2)
df3 <- data.frame(add3, coords3, vals3, extra3)
list_all <- list(df1, df2, df3)
coordinate.extract <- unique(unlist(lapply(list_all, "[", 1)))
my_matrix <- matrix(0, ncol = length(list_all)
, nrow = (length(coordinate.extract)))
my_matrix_new <- cbind(as.character(coordinate.extract)
, my_matrix)
I would like to end up with:
my_matrix_new = V1 V2 V3 V4
1 a a
2 b b
3 c c c
4 d
5 e e
i.e. the 3rd column of each list element is chosen based on the value of the second column.
I hope this is clear.
Thanks,
Matt

I would use data.frame as there are mixed classes. You may try merge with Reduce to get the expected output. Select the 2nd and 3rd columns,in each list element, change the column name for the 2nd to be same across all the list elements, merge, and if needed replace the NA elements with ''
lst1 <- lapply(list_all, function(x) {names(x)[2] <- 'V1';x[2:3] })
res <- Reduce(function(...) merge(..., by='V1', all=TRUE), lst1)
res[-1] <- lapply(res[-1], as.character)
res[is.na(res)] <- ''
res
# V1 vals1 vals2 vals3
#1 1 a a
#2 2 b b
#3 3 c c c
#4 4 d
#5 5 e e
We can change the column names
names(res) <- paste0('V', seq_along(res))

Related

Apply function on multiple lists in R

I have four lists each with multiple data frames.
I need to apply the same function on the lists.
How can I do this?
Sample data:
df1 <- data.frame(x = 1:3, y = letters[1:3])
df2 <- data.frame(x = 4:6, y = letters[4:6])
df3 <- data.frame(x = 7:9, y = letters[7:9])
df4 <- data.frame(x = 10:12, y = letters[10:12])
list1 <- list(df1,df2)
list2 <- list(df3,df4)
In my real data I import based on a pattern in the filename and thus my list elements will have the following names (sample data):
names(list1) <- c("./1. Data/df1.csv", "./1. Data/df2.csv")
names(list2) <- c("./1. Data/df3.csv", "./1. Data/df4.csv")
And this is one of the functions I want to run on all lists.
element.name <- function(x) {
all_filenames <- names(x) %>%
basename() %>%
as.list()
names(x) <- all_filenames
names(x) <- gsub("\\.csv", "", names(x))
}
which will give the desired output
names(list1) <- element.name(list1)
names(list1)
[1] "df1" [2] "df2"
I've tried using a for loop but I end up overwriting my output, so I hope some of you can help me out, since I need to run a lot of functions on my lists.
You could create a list of your lists, and then use lapply to apply to every list the function element.name. You can use setNames to avoid problems linked the assignment on names. You can then use list2env to get your data.frames back to the global environment.
setNames(list(list1, list2), c('list1', 'list2')) |>
lapply(function(x) setNames(x, element.name(x))) |>
list2env()
output
> list1
$df1
x y
1 1 a
2 2 b
3 3 c
$df2
x y
1 4 d
2 5 e
3 6 f
> list2
$df3
x y
1 7 g
2 8 h
3 9 i
$df4
x y
1 10 j
2 11 k
3 12 l
Here is an approach using data.table::fread
library(data.table)
# create dummy CSVs -------------------------------------------------------
DT1 <- data.frame(x = 1:3, y = letters[1:3])
DT2 <- data.frame(x = 4:6, y = letters[4:6])
DT3 <- data.frame(x = 7:9, y = letters[7:9])
DT4 <- data.frame(x = 10:12, y = letters[10:12])
mapply(write.csv, x = list(DT1, DT2, DT3, DT4), file = list("DT1.csv", "DT2.csv", "DT3.csv", "DT4.csv"), row.names = FALSE)
# read in CSVs ------------------------------------------------------------
csv_paths <- list.files(path = ".", pattern = ".csv$")
# might need to split this into different steps due to different csv formats?
DT_list <- setNames(lapply(csv_paths, fread), tools::file_path_sans_ext(basename(csv_paths)))
# apply a function to each data.table -------------------------------------
lapply(DT_list, function(DT){DT[, test := x*2]})
If you want to stick with the given dummy data just merge the lists:
list1 <- list(df1,df2)
list2 <- list(df3,df4)
DT_list <- setNames(c(list1, list2), tools::file_path_sans_ext(basename(csv_paths)))

How to cast a list to a data frame with unequal columns names, base R only

I have read
What is the most efficient way to cast a list as a data frame?
Convert a list to a data frame
I have a list with unequal columns names which I try to convert to a data frame, with NA for the missing entries in the shorter rows. It is easy with tidyverse (for example with bind_rows), but this is for a low level package that should use base R only.
mylist = list(
list(a = 3, b = "anton"),
list(a = 5, b = "bertha"),
list(a = 7, b = "caesar", d = TRUE)
)
# No problem with equal number of columns
do.call(rbind, lapply(mylist[1:2], data.frame))
# The list of my names
unique(unlist(lapply(mylist, names)))
# rbind does not like unequal numbers
do.call(rbind, lapply(mylist, data.frame))
Find out the unique columns in the list, in lapply add the additional columns using setdiff.
cols <- unique(unlist(sapply(mylist, names)))
do.call(rbind, lapply(mylist, function(x) {
x <- data.frame(x)
x[setdiff(cols, names(x))] <- NA
x
}))
# a b d
#1 3 anton NA
#2 5 bertha NA
#3 7 caesar TRUE
use indexes instead of columns and transpose it afterwards
l1 = [1,1]
l2 = [2,2,2,2]
df = pd.DataFrame([l1,l2], index = ('l1', 'l2'))
df.T
# l1 l2
# 0 1 2
# 1 1 2
# 2 NaN 2
# 3 NaN 2

R add columns based on modifying other columns of dataframes within a list

I would like to add a new column D to data.frames in a list that contains the first part of column B. But I'm not sure how to adress within lists down to the column level?
create some data
df1 <- data.frame(A = "hey", B = "wass.7", C = "up")
df2 <- data.frame(A = "how", B = "are.1", C = "you")
dfList <- list(df1,df2)
desired output:
# a new column removing the last part of column B
[[1]]
A B C D
1 hey wass.7 up wass
[[2]]
A B C D
1 how are.1 you are
for each data frame I did this, which worked
df1$D<-sub('\\..*', '', df1$B)
in a function I tried this, which is probably
not correctly addressing the columns and returns
"unexpected symbol..."
dfList <- lapply(rapply(dfList, function(x)
x$D<-sub('\\..*', '', x$B) how = "list"),
as.data.frame)
the lapply(rapply) part is copied from Using gsub in list of dataframes with R
Check this out
lapply(dfList, function(x){
x$D <-sub('\\..*', '', x$B);
x
})
[[1]]
A B C D
1 hey wass.7 up wass
[[2]]
A B C D
1 how are.1 you are
The rapply solution does work. However, you needed a comma before the how argument to resolve the error. Additionally, you will NOT be able to assign one new column only replace existing ones. Since rapply is a recursive call, it will run the gsub across every element in nested list so across ALL columns of ALL dataframes.
Otherwise use a simple lapply per #JilberUrbina's answer.
df1 <- data.frame(A = "hey", B = "wass.7", C = "up", stringsAsFactors = F)
df2 <- data.frame(A = "how", B = "are.1", C = "you", stringsAsFactors = F)
dfList <- list(df1,df2)
dfList <- lapply(rapply(dfList, function(x)
sub('\\..*', '', x), how = "list"),
as.data.frame)
dfList
# [[1]]
# A B C
# 1 hey wass up
# [[2]]
# A B C
# 1 how are you

R - Change column name using get()

I have several data.frames in my Global Environment that I need to merge. Many of the data.frames have identical column names. I want to append a suffix to each column that marks its originating data.frame. Because I have many data.frames, I wanted to automate the process as in the following example.
df1 <- data.frame(id = 1:5,x = LETTERS[1:5])
df2 <- data.frame(id = 1:5,x = LETTERS[6:10])
obj <- ls()
for(o in obj){
s <- sub('df','',eval(o))
names(get(o))[-1] <- paste0(names(get(o))[-1],'.',s)
}
# Error in get(o) <- `*vtmp*` : could not find function "get<-"'
But the individual pieces of the assignment work fine:
names(get(o))[-1]
# [1] "x"
paste0(names(get(o))[-1],'.',s)
# [1] "x.1"
I've used get in a similar way to write.csveach object to a file.
for(o in obj){
write.csv(get(o),file = paste0(o,'.csv'),row.names = F)
}
Any ideas why it's not working in the assignment to change the column names?
The error "could not find function get<-" is R telling you that you can't use <- to update a "got" object. You could probably use assign, but this code is already difficult enough to read. The better solution is to use a list.
From your example:
df1 <- data.frame(id = 1:5,x = LETTERS[1:5])
df2 <- data.frame(id = 1:5,x = LETTERS[6:10])
# put your data frames in a list
df_names = ls(pattern = "df[0-9]+")
df_names # make sure this is the objects you want
# [1] "df1" "df2"
df_list = mget(df_names)
# now we can use a simple for loop (or lapply, mapply, etc.)
for(i in seq_along(df_list)) {
names(df_list[[i]])[-1] =
paste(names(df_list[[i]])[-1],
sub('df', '', names(df_list)[i]),
sep = "."
)
}
# and the column names of the data frames in the list have been updated
df_list
# $df1
# id x.1
# 1 1 A
# 2 2 B
# 3 3 C
# 4 4 D
# 5 5 E
#
# $df2
# id x.2
# 1 1 F
# 2 2 G
# 3 3 H
# 4 4 I
# 5 5 J
It's also now easy to merge them:
Reduce(f = merge, x = df_list)
# id x.1 x.2
# 1 1 A F
# 2 2 B G
# 3 3 C H
# 4 4 D I
# 5 5 E J
For more discussion and examples, see How do I make a list of data frames?
Using setnames from library(data.table) you can do
for(o in obj) {
oldnames = names(get(o))[-1]
newnames = paste0(oldnames, ".new")
setnames(get(o), oldnames, newnames)
}
You can use eval which evaluate an R expression in a specified environment.
df1 <- data.frame(id = 1:5,x = LETTERS[1:5])
df2 <- data.frame(id = 1:5,x = LETTERS[6:10])
obj <- ls()
for(o in obj) {
s <- sub('df', '', o)
new_name <- paste0(names(get(o))[-1], '.', s)
eval(parse(text = paste0('names(', o, ')[-1] <- ', substitute(new_name))))
}
modify df1 and df2
id x.1
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E

R - co-locate columns with the same name after merge

Situation
I have two data frames, df1 and df2with the same column headings
x <- c(1,2,3)
y <- c(3,2,1)
z <- c(3,2,1)
names <- c("id","val1","val2")
df1 <- data.frame(x, y, z)
names(df1) <- names
a <- c(1, 2, 3)
b <- c(1, 2, 3)
c <- c(3, 2, 1)
df2 <- data.frame(a, b, c)
names(df2) <- names
And am performing a merge
#library(dplyr) # not needed for merge
joined_df <- merge(x=df1, y=df2, c("id"),all=TRUE)
This gives me the columns in the joined_df as id, val1.x, val2.x, val1.y, val2.y
Question
Is there a way to co-locate the columns that had the same heading in the original data frames, to give the column order in the joined data frame as id, val1.x, val1.y, val2.x, val2.y?
Note that in my actual data frame I have 115 columns, so I'd like to stay clear of using joned_df <- joined_df[, c(1, 2, 4, 3, 5)] if possible.
Update/Edit: also, I would like to maintain the original order of column headings, so sorting alphabetically is not an option (-on my actual data, I realise it would work with the example I have given).
My desired output is
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Update with solution for general case
The accepted answer solves my issue nicely.
I've adapted the code slightly here to use the original column names, without having to hard-code them in the rep function.
#specify columns used in merge
merge_cols <- c("id")
# identify duplicate columns and remove those used in the 'merge'
dup_cols <- names(df1)
dup_cols <- dup_cols [! dup_cols %in% merge_cols]
# replicate each duplicate column name and append an 'x' and 'y'
dup_cols <- rep(dup_cols, each=2)
var <- c("x", "y")
newnames <- paste(dup_cols, ".", var, sep = "")
#create new column names and sort the joined df by those names
newnames <- c(merge_cols, newnames)
joined_df <- joined_df[newnames]
How about something like this
numrep <- rep(1:2, each = 2)
numrep
var <- c("x", "y")
var
newnames <- paste("val", numrep, ".", var, sep = "")
newdf <- cbind(joined_df$id, joined_df[newnames])
names(newdf)[1] <- "id"
Which should give you the dataframe like this
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1

Resources