Remove column name pattern in multiple dataframes in R - r

I have >100 dataframes loaded into R with column name prefixes in some but not all columns that I would like to remove. In the below example with 3 dataframes, I would like to remove the pattern x__ in the 3 dataframes but keep all the dataframe names and everything else the same. How could this be done?
df1 <- data.frame(`x__a` = rep(3, 5), `x__b` = seq(1, 5, 1), `x__c` = letters[1:5])
df2 <- data.frame(`d` = rep(5, 5), `x__e` = seq(2, 6, 1), `f` = letters[6:10])
df3 <- data.frame(`x__g` = rep(5, 5), `x__h` = seq(2, 6, 1), `i` = letters[6:10])

You could put the data frames in a list and use an anonymous function with gsub.
lst <- mget(ls(pattern='^df\\d$'))
lapply(lst, \(x) setNames(x, gsub('x__', '', names(x))))
# $df1
# a b c
# 1 3 1 a
# 2 3 2 b
# 3 3 3 c
# 4 3 4 d
# 5 3 5 e
#
# $df2
# d e f
# 1 5 2 f
# 2 5 3 g
# 3 5 4 h
# 4 5 5 i
# 5 5 6 j
#
# $df3
# g h i
# 1 5 2 f
# 2 5 3 g
# 3 5 4 h
# 4 5 5 i
# 5 5 6 j
If you have no use of the list, move the changed dfs back into .GlobalEnv using list2env, but I don't recommend it, since it overwrites.
lapply(lst, \(x) setNames(x, gsub('x__', '', names(x)))) |> list2env(.GlobalEnv)

Related

How to rbind two dataframes in R when one has more columns than the other [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 2 years ago.
I want to merge three dataframes together, appending them as rows to the bottom of the previous, but I want their columns to match. Each dataframe has a different number of columns, but they share column names. EX:
Dataframe A Dataframe B Dataframe C
A B Y Z A B C X Y Z A B C D W X Y Z
# # # # # # # # # # # # # # # # # #
In the end, I want them to look like:
Dataframe_Final
A B C D W X Y Z
# # # #
# # # # # #
# # # # # # # #
How can I merge these dataframes together in this way? Again, there's no ID for the rows that is unique (ascending, etc) across the dataframes.
Thanks!
A base R option might be Reduce + merge
out <- Reduce(function(x,y) merge(x,y,all = TRUE),list(dfA,dfB,dfC))
out <- out[order(names(out))]
which gives
A B C D W X Y Z
1 1 2 NA NA NA NA 3 4
2 1 2 3 NA NA 4 5 6
3 1 2 3 4 5 6 7 8
Dummy Data
dfA <- data.frame(A = 1, B = 2, Y = 3, Z = 4)
dfB <- data.frame(A = 1, B = 2, C = 3, X = 4, Y = 5, Z = 6)
dfC <- data.frame(A = 1, B = 2, C = 3, D = 4, W = 5, X = 6, Y = 7, Z = 8)

Post-processing of full_join output to remove multiplicity

I have two data frames(df1, df2) and performed full_join using the common column of interest col1.
df1 <- data.frame(col1=c('A','D','C','C','E','E','I'),col2=c(4,7,8,3,2,4,9))
df2 <- data.frame(col1=c('A','A','B','C','C','E','E','I'),col2=c(4,1,6,8,3,2,1,9))
df1 %>% full_join(df2, by = "col1")
# col1 col2.x col2.y
# 1 A 4 4
# 2 A 4 1
# 3 D 7 NA
# 4 C 8 8
# 5 C 8 3
# 6 C 3 8
# 7 C 3 3
# 8 E 2 2
# 9 E 2 1
# 10 E 4 2
# 11 E 4 1
# 12 I 9 9
# 13 B NA 6
As expected the full_join provides multiplicty of the joining column values and I wish to avoid it. I wish to arrive at the following output. What kind of post-processing approaches do you suggest?
# col1 col2.x col2.y
# 1 A 4 4
# 2 A NA 1
# 3 D 7 NA
# 4 C 8 8
# 5 C 3 3
# 6 E 2 2
# 7 E 4 1
# 8 I 9 9
# 9 B NA 6
More information:
Case 1: I do not need four rows in the output for two same values in both input objects:
# 4 C 8 8
# 5 C 8 3
# 6 C 3 8
# 7 C 3 3
instead, I want only two as:
# 4 C 8 8
# 5 C 3 3
Case 2: Similarly, I need same row for the difference in values:
# 8 E 2 2
# 9 E 2 1
# 10 E 4 2
# 11 E 4 1
instead, I want only two rows as below:
# 8 E 2 2
# 9 E 4 1
A possible solution in 2 steps using the data.table-package:
0) load package & convert to data.table's
library(data.table)
setDT(df1)
setDT(df2)
1) define helper function
unlistSD <- function(x) {
l <- length(x)
ls <- sapply(x, lengths)
m <- max(ls)
newSD <- vector(mode = "list", length = l)
for (i in 1:l) {
u <- unlist(x[[i]])
lu <- length(u)
if (lu < m) {
u <- c(u, rep(NA_real_, m - lu))
}
newSD[[i]] <- u
}
return(setNames(as.list(newSD), names(x)))
}
2) merge and apply helper function
merge(df1[, .(col2 = list(col2)), by = col1],
df2[, .(col2 = list(col2)), by = col1],
by = "col1", all = TRUE
)[, unlistSD(.SD), by = col1]
which gives the following result:
col1 col2.x col2.y
1: A 4 4
2: A NA 1
3: C 8 8
4: C 3 3
5: D 7 NA
6: E 2 2
7: E 4 1
8: I 9 9
9: B NA 6
Another possibiliy with base R:
unlistDF <- function(d, groupcols) {
ds <- split(d[, setdiff(names(d), groupcols)], d[,groupcols])
ls <- lapply(ds, function(x) max(sapply(x, lengths)))
dl <- lapply(ds, function(x) lapply(as.list(x), unlist))
du <- Map(function(x, y) {
lapply(x, function(i) {
if(length(i) < y) {
c(i, rep(NA_real_, y - length(i)))
} else i
})
}, x = dl, y = ls)
ld <- lapply(du, as.data.frame)
cbind(d[rep(1:nrow(d), ls), groupcols, drop = FALSE],
do.call(rbind.data.frame, c(ld, make.row.names = FALSE)),
row.names = NULL)
}
Now you can use this function as follows in combination with merge:
df <- merge(aggregate(col2 ~ col1, df1, as.list),
aggregate(col2 ~ col1, df2, as.list),
by = "col1", all = TRUE)
unlistDF(df, "col1")

R - Sum list of matrix with different columns

I have a large list of matrix with different columns and I would like to sum these matrix counting 0 if column X does not exist in one matrix.
If you have used the function rbind.fill from plyr I would like something similar but with sum function. Of course I could build a function to do that, but I'm thinking about a native function efficiently programmed in Frotrain or C due to my large data.
Here an example:
This is the easy example where I have the same columns:
aa <- list(
m1 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
aa
Reduce('+',aa)
Giving the results:
> aa
$m1
a b c
1 1 4 7
2 2 5 8
3 3 6 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
> Reduce('+',aa)
a b c
1 2 8 14
2 4 10 16
3 6 12 18
And with my data:
bb <- list(
m1 = matrix(c(1,2,3,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','c'))),
m2 = matrix(c(1,2,3,4,5,6,7,8,9), nrow = 3, dimnames = list(c(1,2,3),c('a','b','c')))
)
bb
Reduce('+',bb)
Here I would like to have b = c(0,0,0) in the first matrix to sum them.
> bb
$m1
a c
1 1 7
2 2 8
3 3 9
$m2
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Many thanks!
Xevi
One option would be
un1 <- sort(unique(unlist(lapply(bb, colnames))))
bb1 <- lapply(bb, function(x) {
nm1 <- setdiff(un1, colnames(x))
m1 <- matrix(0, nrow = nrow(x), ncol = length(nm1), dimnames = list(NULL, nm1))
cbind(x, m1)[, un1]})
and use the Reduce
Reduce(`+`, bb1)
# a b c
# 1 2 4 14
# 2 4 5 16
# 3 6 6 18

Assign the results of do.call using cbind to data frames

I want to combine multiple sets of two data frames (a & a_1, b & b_1, etc.). Basically, I want to do what this question is asking. I created a list of my two data sets:
# create data
a <- c(1, 2, 3)
b <- c(2, 3, 4)
at0H0 <- data.frame(a, b)
c <- c(1, 2, 3)
d <- c(2, 3, 4)
at0H0_1 <- data.frame(c, d)
e <- c(1, 2, 3)
f <- c(2, 3, 4)
at0H1 <- data.frame(a, b)
g <- c(1, 2, 3)
h <- c(2, 3, 4)
at0H1_1 <- data.frame(c, d)
# create lists of names
names <- list("at0H0", "at0H1")
namesLPC <- list("at0H0_1", "at0H1_1")
# column bind the data frames?
dfList <- list(cbind(names, namesLPC))
do.call(cbind, dfList)
But now I need it to create data frames for each. This do.call function just creates a list of the names of the data frames. Thanks!
(Edited to make reproducible code)
It's not super straight-forward, but with a little editing to a joining function you can get there:
joinfun <- function(x) do.call(cbind, unname(mget(x,inherits=TRUE)))
lapply(Map(c, names, namesLPC), joinfun)
#[[1]]
# a b c d
#1 1 2 1 2
#2 2 3 2 3
#3 3 4 3 4
#
#[[2]]
# a b c d
#1 1 2 1 2
#2 2 3 2 3
#3 3 4 3 4
The Map function pairs up the dataset names as required:
Map(c, names, namesLPC)
#[[1]]
#[1] "at0H0" "at0H0_1"
#
#[[2]]
#[1] "at0H1" "at0H1_1"
The lapply then loops over each part of the above list to mget (multiple-get) each object into a combined list. Like so, for the first part:
unname(mget(c("at0H0","at0H0_1"),inherits=TRUE))
#[[1]]
# a b
#1 1 2
#2 2 3
#3 3 4
#
#[[2]]
# c d
#1 1 2
#2 2 3
#3 3 4
Finally, do.call(cbind, ...) puts this combined list back into a single data.frame:
do.call(cbind, unname(mget(c("at0H0","at0H0_1"),inherits=TRUE)))
# a b c d
#1 1 2 1 2
#2 2 3 2 3
#3 3 4 3 4
I've figured out a way to do it. A few notes: I have 360 data sets that I need to combine, which is why it is i in 1:360. This also names the data sets from an array of the names of the data sets (which is dataNames)
for (i in 1:360){
assign(paste(dataNames[i], sep = ""), cbind(names[[i]], namesLPC[[i]]))
}

Function for testing if a row in a dataframe doesn't contain a given element

I have a dataframe like
a <- c(2, 3, 4)
b <- c(5, 4, 3)
c <- c(2, 7, 9)
df <- data.frame(a, b, c)
df
# a b c
# 1 2 5 2
# 2 3 4 7
# 3 4 3 9
and I want to get back the row without number 2, in my example it is just second row.
Using rowSums or colSums:
# data
a <- c(2, 3, 4)
b <- c(5, 4, 3)
c <- c(2, 7, 9)
df <- data.frame(a, b, c)
df
# a b c
# 1 2 5 2
# 2 3 4 7
# 3 4 3 9
# get rows with no 2
df[ rowSums(df == 2, na.rm = TRUE) == 0, ]
# a b c
# 2 3 4 7
# 3 4 3 9
# get columns with no 2
df[ , colSums(df == 2, na.rm = TRUE) == 0, drop = FALSE ]
# b
# 1 5
# 2 4
# 3 3
We can also use Reduce with == to get the rows
df[!Reduce(`|`, lapply(df, `==`, 2)),]
# a b c
#2 3 4 7
#3 4 3 9
and any with lapply to select the columns
df[!sapply(df, function(x) any(x== 2))]
# b
#1 5
#2 4
#3 3
Here is my solution using some set functions. First, where are the positions of the twos?
is_two <- apply(df, 1, is.element, 2)
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] TRUE FALSE FALSE
Now, which rows are all FALSE?
no_twos <- apply(!is_two, 1, all)
df[no_twos,]
a b c
2 3 4 7

Resources