R remove common columns in dataframes - r

I have 2 dfs (simplified example):
df1 a b c g ...
1 0 0 0
2 0 0 1
And
df2 a b d e f ...
1 1 0 0 0
2 0 0 0 1
I would like to merge the 2 dfs but before joining I would like to remove common columns in df1 and df2. So I would retain columns (c,d,e,f,g) as a and b are common in df1 and df2.
So basically doing the opposite of what was answered here:
delete columns in data frame not in common with another (R)

Using set operations viz. union intersect and setdiff on names of both dfs, we may do this
df1 <- read.table(header = T, text = 'a b c g
1 0 0 0
2 0 0 1')
df2 <- read.table(header = T, text = 'a b d e f
1 1 0 0 0
2 0 0 0 1')
# uncommon column names
x <- setdiff(union(names(df1), names(df2)), intersect(names(df1), names(df2)))
cbind(df1[names(df1) %in% x], df2[names(df2) %in% x])
#> c g d e f
#> 1 0 0 0 0 0
#> 2 0 1 0 0 1
Created on 2021-06-15 by the reprex package (v2.0.0)

In base R, you can start by using the duplicated function to work out which column names both data frames have in common. From there, it's just a matter of selecting and binding the columns from each data frame that are not on this list.
dupes <- c(names(df1), names(df2))[duplicated(c(names(df1), names(df2)))]
df3 <- cbind(df1[, -which(names(df1) %in% dupes)], df2[, -which(names(df2) %in% dupes)])
Following your example, this would produce the following data frame, consisting only of the unique columns from each of the others. This is based on the assumption that both data frames have the same number of rows.
df3 c g d e f ...
0 0 0 0 0
0 1 0 0 1

Related

Transforming dataset

I have a dataset that looks like below that I want to trasnsform to another format that assign true/false based on whether certain string is present. What's the best way to do it either in Excel or R?
Thanks!
Initial dataset:
Row1 A D
Row2 B C
Row3 A C E
The format I want:
A B C D E
Row1 1 0 0 1 0
Row2 0 1 1 0 0
Row3 1 0 1 0 1
Here is a base R way with lapply and xtabs.
I assume that filename holds the data file name.
x <- readLines(filename)
x <- strsplit(x, " ")
l <- lapply(x, \(y) {
values <- y[-1]
rows <- rep(y[1], length(values))
data.frame(rows, values)
})
df1 <- do.call(rbind, l)
rm(x, l)
xtabs(~ rows + values, df1)
#> values
#> rows A B C D E
#> Row1 1 0 0 1 0
#> Row2 0 1 1 0 0
#> Row3 1 0 1 0 1
Created on 2022-09-08 by the reprex package (v2.0.1)

Adding a row to a data frame when the row contains more elements than there are columns in the data frame

I would like to add a new row to a data frame; however, this row in my case has more values. Let's assume I have the following dataset:
> df <- data.frame(0,0,0)
> colnames(df) <- c("A","B","C")
> df
A B C
0 0 0
Now let us have a vector with 4 elements.
> x <- c(0,0,0,0)
> names(x) <- c("A","B","D","C")
> x
A B D C
0 0 0 0
I would like to add this vector to the data frame above such that
> df
A B C D
0 0 0 NA
0 0 0 0
Is there a way to do this?
Using rbindlist
library(data.table)
rbindlist(list(df, as.data.frame.list(x)), fill = TRUE)
A B C D
1: 0 0 0 NA
2: 0 0 0 0
You may use -
library(dplyr)
df %>% bind_rows(x %>%
t %>%
as.data.frame())
# A B C D
#1 0 0 0 NA
#2 0 0 0 0
Or as #Andrew Gustar mentioned -
dplyr::bind_rows(df, x)

Add X number of columns to a data.frame

I would like to add a varying number (X) of columns with 0 to an existing data.frame within a function.
Here is an example data.frame:
dt <- data.frame(x=1:3, y=4:6)
I would like to get this result if X=1 :
a x y
1 0 1 4
2 0 2 5
3 0 3 6
And this if X=3 :
a b c x y
1 0 0 0 1 4
2 0 0 0 2 5
3 0 0 0 3 6
What would be an efficient way to do this?
We can assign multiple columns to '0' based on the value of 'X'
X <- 3
nm1 <- names(dt)
dt[letters[seq_len(X)]] <- 0
dt[c(setdiff(names(dt), nm1), nm1)]
Also, we can use add_column from tibble and create columns at a specific location
library(tibble)
add_column(dt, .before = 1, !!!setNames(as.list(rep(0, X)),
letters[seq_len(X)]))
A second option is cbind
f <- function(x, n = 3) {
cbind.data.frame(matrix(
0,
ncol = n,
nrow = nrow(x),
dimnames = list(NULL, letters[1:n])
), x)
}
f(dt, 5)
# a b c d e x y
#1 0 0 0 0 0 1 4
#2 0 0 0 0 0 2 5
#3 0 0 0 0 0 3 6
NOTE: because letters has a length of 26 the function would need some adjustment regarding the naming scheme if n > 26.
You can try the code below
dt <- cbind(`colnames<-`(t(rep(0,X)),letters[seq(X)]),dt)
If you don't care the column names of added columns, you can use just
dt <- cbind(t(rep(0,X)),dt)
which is much shorter

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

delete columns in data frame not in common with another (R)

I have two dataframes with differenth lengths. On is a sample and the other a test sample
df1 a b c d ...
1 0 0 0
2 0 0 1
df2 a e b c d ...
1 1 0 0 0
2 0 0 0 1
How can I delete the columns of df2 not in common with df1 ?
As a result I'm looking for df2 with the same columns as df1 (a, b, c, d ...).
I tried merge() but its not what i'm looking for.
If I understand your question correctly you can subset by the column-names like this:
df2[, colnames(df1)]
If you have column names in df1 not present in df2 you can do
df2[, intersect(colnames(df1), colnames(df2))]
Edit: forgot a comma

Resources