How to merge two dataframes and keep only different columns (content)?

How to merge two dataframes and keep only different columns (content)? - r

I have two data frame with same row size and different column number, the name of the columns is also different, however the content may be similar in some of them.
i.e. df1:
df1<- data.frame("a"=c("0","1","0","1","0","0","0"),
"b"=c("1","1","1","1","1","0","0"),
"c"=c("1","1","0","0","1","0","0"),
"d"=c("1","1","1","1","1","1","1"))
df2:
df2<- data.frame("e"=c("1","1","0","1","0","0","0"),
"f"=c("1","1","1","1","1","0","0"),
"g"=c("0","0","0","0","1","0","0"),
"h"=c("0","0","0","0","1","1","1"))
If you see, the column "b" of df1 and "f" of df2 are equal. Therefore, the result I want is a new dataframe looking like this:
df3 <- data.frame("a"=c("0","1","0","1","0","0","0"),
"c"=c("1","1","0","0","1","0","0"),
"d"=c("1","1","1","1","1","1","1"),
"e"=c("1","1","0","1","0","0","0"),
"g"=c("0","0","0","0","1","0","0"),
"h"=c("0","0","0","0","1","1","1"))
NOTE: column "b" and "f" (that were similar) are not in the new df3.
I have looked in the web but I did not find an example for this. I think the major complexity is that the merge is by content and not by column name.

This would do the job:
df3 <- cbind(df1,df2)
df3 <- t(t(df3)[!(duplicated(t(df3)) | duplicated(t(df3), fromLast = TRUE)),])
df3
# a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1
this will give you a matrix, you can save the result as a df if so desired

We can use sapply to check for the columns that perfectly match.
mat <- sapply(df1, function(x) sapply(df2, function(y) all(x == y)))
mat
# a b c d
#e FALSE FALSE FALSE FALSE
#f FALSE TRUE FALSE FALSE
#g FALSE FALSE FALSE FALSE
#h FALSE FALSE FALSE FALSE
Here we can see column b from df1 and column f from df2 should be removed. We can do this by :
m2 <- which(mat, arr.ind = TRUE)
cbind(df1[-m2[, 2]], df2[-m2[, 1]])
# a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

Here is a more tidyverse solution.
library(dplyr)
library(tidyr)
# based on Ronak's sapply approach
matches <- as.data.frame(sapply(df1, function(x) sapply(df2, function(y) identical(x, y)))) %>%
rownames_to_column(var = "df2") %>%
pivot_longer(-df2, names_to = "df1") %>% # pivot longer
filter(value) # keep only the matches
# programmatically build list of names to remove
vars_remove <- c(matches$df1, matches$df2) # will remove var names that are matches
df1 %>% bind_cols(df2) %>%
select(-any_of(vars_remove))
a c d e g h
1 0 1 1 1 0 0
2 1 1 1 1 0 0
3 0 0 1 0 0 0
4 1 0 1 1 0 0
5 0 1 1 0 1 1
6 0 0 1 0 0 1
7 0 0 1 0 0 1

We can use outer from base R
mat <- outer(df1, df2, FUN = Vectorize(function(x, y) all(x == y)))
mat
# e f g h
#a FALSE FALSE FALSE FALSE
#b FALSE TRUE FALSE FALSE
#c FALSE FALSE FALSE FALSE
#d FALSE FALSE FALSE FALSE
Now, we can get the row/column names
m2 <- as.matrix(subset(as.data.frame.table(mat), Freq, select = -Freq))
Now, we use the 'm2' to get remove the column names from 'df1', 'df2' and cbind
cbind(df1[setdiff(names(df1), m2[,1])], df2[setdiff(names(df2), m2[,2])])
# a c d e g h
#1 0 1 1 1 0 0
#2 1 1 1 1 0 0
#3 0 0 1 0 0 0
#4 1 0 1 1 0 0
#5 0 1 1 0 1 1
#6 0 0 1 0 0 1
#7 0 0 1 0 0 1

Related

find how many times and which columns a string is repeated

My data consists of 6 strings per each element. It has string with 6 characters. The data has white space too.
I want to know how many times each string is repeated in all columns
for example P67809 is repeated 2 times in column a and column d
so the output should look likes
string No columns
P67809 2 a,b
Based on this function I can assign a row number to each string
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
Then I apply the function on all and each columns string like
myS <- lapply(mydata, normalize,";")
but I don't know how to then search and get the output

We could melt the data from 'wide' to 'long' format. Split the 'value' column with ; to get a list output. We set the names of the list as the 'variable' column of 'dM'. Then stack the list to a two column output, and get the frequency count with 'tbl'. It may be easier to understand the result from the 'tbl' output.
library(reshape2)
dM <- melt(mydata, id.var=NULL)
lst1 <- setNames(strsplit(dM$value, ";"), dM$variable)
tbl <- table(stack(lst1)[2:1])
tbl
values
#ind A4QPH2 O60814 P0CG47 P0CG48 P14923 P15924 P19338 P35908 P42356 P57053 P58876 P62750 P62807 P62851 P62979 P63241 P67809 Q02413 Q06830 Q07955 Q16658 Q5QNW6 Q6IS14 Q8N8J0 Q93079 Q969S3
# a 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
# b 3 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0
# c 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 0 1 0 0 0
# d 0 0 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1
# values
#ind Q99877 Q99879 Q9Y2T7
# a 0 0 1
# b 0 0 0
# c 0 0 0
# d 1 1 1
We get the total number of each element with colSums.
cS <- colSums(tbl)
If we need to get the output as in the OP's post, we can melt the list output to create a 2 column data.frame. From this, we convert to 'data.table' (setDT(), grouped by 'value' column , we get the length of unique elements of 'variable' and also paste together the unique elements.
library(data.table)
res <- setDT(melt(lst1))[, list(No= uniqueN(L1),
columns= toString(unique(L1))) ,.(string=value)]
head(res,2)
# string No columns
#1: P67809 2 a, d
#2: Q9Y2T7 2 a, d

One approach might be:
res <- apply(mydata, 2, function(x) unlist(strsplit(x, ";")))
un <- unique(unlist(res))
res2 <- sapply(un, function(x) lapply(res, function(y) as.numeric(x %in% y)))
res2
P67809 Q9Y2T7 P42356 Q8N8J0 A4QPH2 P35908 P19338 P15924 P14923 Q02413 P63241 Q6IS14
a 1 1 1 1 1 1 1 1 1 0 0 0
b 0 0 0 0 0 0 0 0 0 1 1 1
c 0 0 0 0 0 0 0 0 0 1 1 1
d 1 1 0 0 0 0 0 0 0 0 0 0
P62979 P0CG47 P0CG48 Q16658 P62851 Q07955 Q06830 P62807 O60814 P57053 Q99879 Q99877
a 0 0 0 0 0 0 0 0 0 0 0 0 0
b 1 1 1 1 0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 1 0 0 0 0 0 0 0
d 1 1 1 0 0 0 1 1 1 1 1 1 1
Q93079 Q5QNW6 P58876 P62750 Q969S3
a 0 0 0 0 0
b 0 0 0 0 0
c 0 0 0 0 0
d 1 1 1 1 1
as.data.frame(t(apply(t(res2), 1, function(x) cbind(sum(as.numeric(x)), paste(names(x)[which(as.logical(x))], collapse = ",")))))
V1 V2
P67809 2 a,d
Q9Y2T7 2 a,d
P42356 1 a
Q8N8J0 1 a
A4QPH2 1 a
P35908 1 a
P19338 1 a
P15924 1 a
P14923 1 a
Q02413 2 b,c
P63241 2 b,c
Q6IS14 2 b,c
P62979 3 b,c,d
P0CG47 3 b,c,d
P0CG48 3 b,c,d
2 b,c
Q16658 1 c
P62851 1 c
Q07955 1 d
Q06830 1 d
P62807 1 d
O60814 1 d
P57053 1 d
Q99879 1 d
Q99877 1 d
Q93079 1 d
Q5QNW6 1 d
P58876 1 d
P62750 1 d
Q969S3 1 d

An alternative approach with cSplit from splitstackshape and gather from tidyr.
library(splitstackshape)
library(tidyr)
library(dplyr)
splitted <- cSplit(mydata, splitCols = names(mydata), sep = ";") %>% gather() # Split cols and melt data
splitted$key <- substring(splitted$key, 1, 1) # Lose irrelevant string
table(splitted) # Generate frequency table

Populating data from one data.table to another

I have a distance matrix (as data.table) showing pairwise distances between a number of items, but not all items are in the matrix. I need to create a larger data.table that has all the missing items populated. I can do this with matrices fairly easily:
items=c("a", "b", "c", "d")
small_matrix=matrix(c(0, 1, 2, 3), nrow=2, ncol=2,
dimnames=list(c("a", "b"), c("a", "b")))
# create zero matrix of the right size
full_matrix <- matrix(0, ncol=length(items), nrow=length(items),
dimnames=list(items, items))
# populate items from the small matrix
full_matrix[rownames(small_matrix), colnames(small_matrix)] <- small_matrix
full_matrix
# a b c d
# a 0 2 0 0
# b 1 3 0 0
# c 0 0 0 0
# d 0 0 0 0
What is the equivalent of that in data.table? I can create an 'id' column in small_DT and use it as the key, but I'm not sure how to overwrite items in full_DT that has the same id/column pair.

Let's convert to data.table and keep the row names as an extra column:
dts = as.data.table(small_matrix, keep = T)
# rn a b
#1: a 0 2
#2: b 1 3
dtf = as.data.table(full_matrix, keep = T)
# rn a b c d
#1: a 0 0 0 0
#2: b 0 0 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Now just join on the rows, and assuming small matrix is always a subset you can do the following:
dtf[dts, names(dts) := dts, on = 'rn']
dtf
# rn a b c d
#1: a 0 2 0 0
#2: b 1 3 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Above assumes version 1.9.5+. Otherwise you'll need to set the key first.

Suppose you have these two data.table:
dt1 = as.data.table(small_matrix)
# a b
#1: 0 2
#2: 1 3
dt2 = as.data.table(full_matrix)
# a b c d
#1: 0 0 0 0
#2: 0 0 0 0
#3: 0 0 0 0
#4: 0 0 0 0
You can't operate like with data.frame or matrix, eg by doing:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1), with=F] <- dt1
This code will raise an error, because to affect new values, you need to use the := operator:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1):=dt1][]
# a b c d
#1: 0 2 0 0
#2: 1 3 0 0
#3: 0 0 0 0
#4: 0 0 0 0

Create counter of consecutive runs of a certain value

I have data where consecutive runs of zero are separated by runs of non-zero values. I want to create a counter for the runs of zero in the column 'SOG'.
For the first sequence of 0 in SOG, set the counter in column Stops to 1. For the second run of zeros, set 'Stops' to 2, and so on.
SOG Stops
--- -----
4 0
4 0
0 1
0 1
0 1
3 0
4 0
5 0
0 2
0 2
1 0
2 0
0 3
0 3
0 3

SOG <- c(4,4,0,0,0,3,4,5,0,0,1,2,0,0,0)
#run length encoding:
tmp <- rle(SOG)
#turn values into logicals
tmp$values <- tmp$values == 0
#cumulative sum of TRUE values
tmp$values[tmp$values] <- cumsum(tmp$values[tmp$values])
#inverse the run length encoding
inverse.rle(tmp)
#[1] 0 0 1 1 1 0 0 0 2 2 0 0 3 3 3

Try
df$stops<- with(df, cumsum(c(0, diff(!SOG))>0)*!SOG)
df$stops
# [1] 0 0 1 1 1 0 0 0 2 2 0 0 3 3 3

Using dplyr:
library(dplyr)
df <- df %>% mutate(Stops = ifelse(SOG == 0, yes = cumsum(c(0, diff(!SOG) > 0)), no = 0))
df$Stops
#[1] 0 1 1 1 0 0 0 2 2 0 0 3 3 3
EDIT: As an aside to those of us who are still beginners, many of the answers to this question make use of logicals (i.e. TRUE, FALSE). ! before a numeric variable like SOG tests whether the value is 0 and assigns TRUE if it is, and FALSE otherwise.
SOG
#[1] 4 0 0 0 3 4 5 0 0 1 2 0 0 0
!SOG
#[1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#[12] TRUE TRUE TRUE
diff() takes the difference between the value and the one before it. Note that there is one less element in this list than in SOG since the first element doesn't have a lag with which to compute a difference. When it comes to logicals, diff(!SOG) produces 1 for TRUE - FALSE = 1, FALSE - TRUE = -1, and 0 otherwise.
diff(SOG)
#[1] -4 0 0 3 1 1 -5 0 1 1 -2 0 0
diff(!SOG)
#[1] 1 0 0 -1 0 0 1 0 -1 0 1 0 0
So cumsum(diff(!SOG) > 0) just focuses on the TRUE - FALSE changes
cumsum(diff(!SOG) > 0)
#[1] 1 1 1 1 1 1 2 2 2 2 3 3 3
But since the list of differences is one element shorter, we can append an element:
cumsum(c(0, diff(!SOG) > 0)) #Or cumsum( c(0, diff(!SOG)) > 0 )
#[1] 0 1 1 1 1 1 1 2 2 2 2 3 3 3
Then either "multiply" that list by !SOG as in #akrun's answer or use the ifelse() command. If a particular element of SOG == 0, we use the corresponding element from cumsum(c(0, diff(!SOG) > 0)); if it isn't 0, we assign 0.

A one-liner with rle would be -
df <- data.frame(SOG = c(4,4,0,0,0,3,4,5,0,0,1,2,0,0,0))
df <- transform(df, Stops = with(rle(SOG == 0), rep(cumsum(values) * values, lengths)))
df
# SOG Stops
#1 4 0
#2 4 0
#3 0 1
#4 0 1
#5 0 1
#6 3 0
#7 4 0
#8 5 0
#9 0 2
#10 0 2
#11 1 0
#12 2 0
#13 0 3
#14 0 3
#15 0 3

Transforming dataframe into expanded matrix in r

Say I have the following dataframe:
dfx <- data.frame(Var1=c("A", "B", "C", "D", "B", "C", "D", "C", "D", "D"),
Var2=c("E", "E", "E", "E", "A", "A", "A", "B", "B", "C"),
Var1out = c(1,-1,-1,-1,1,-1,-1,1,-1,-1),
Var2out= c(-1,1,1,1,-1,1,1,-1,1,1))
dfx
Var1 Var2 Var1out Var2out
1 A E 1 -1
2 B E -1 1
3 C E -1 1
4 D E -1 1
5 B A 1 -1
6 C A -1 1
7 D A -1 1
8 C B 1 -1
9 D B -1 1
10 D C -1 1
What you see here are 10 rows that correspond to match-ups between players A, B, C, D and E. They play each other once and the winner of each match-up is denoted by a +1 and the loser of each match-up is denoted by a -1 (put into the respective column Player Var1 result in Var1out, Player Var2 result in Var2out).
Desired output.
I wish to transform this dataframe to this output matrix (the order of rows are not important to me, but as you can see each row refers to a unique match-up):
A B C D E
1 1 0 0 0 -1
2 0 -1 0 0 1
3 0 0 -1 0 1
4 0 0 0 -1 1
5 -1 1 0 0 0
6 1 0 -1 0 0
7 1 0 0 -1 0
8 0 -1 1 0 0
9 0 1 0 -1 0
10 0 0 1 -1 0
What I've done:
I managed to make this matrix in a roundabout way. As roundabout ways tend to be slow and less satisfactory, I was wondering if anyone can spot a better way.
I first made sure that my two columns containing players had factor levels that contained every possible player that ever occurs (you'll note for instance that player E never occurs in Var1).
# Making sure Var1 and Var2 have same factor levels
levs <- unique(c(levels(dfx$Var1), levels(dfx$Var2))) #get all possible levels of factors
dfx$Var1 <- factor(dfx$Var1, levels=levs)
dfx$Var2 <- factor(dfx$Var2, levels=levs)
I next split the dataframe into two - one for Var1 and Var1out, and one for Var2 and Var2out:
library(dplyr)
temp.Var1 <- dfx %>% select(Var1, Var1out)
temp.Var2 <- dfx %>% select(Var2, Var2out)
Here I use model.matrix to expand columns by factor level:
mat.Var1<-with(temp.Var1, data.frame(model.matrix(~Var1+0)))
mat.Var2<-with(temp.Var2, data.frame(model.matrix(~Var2+0)))
I then replace for each row the column with a '1' indicating the presence of that factor, with the correct result and add these matrices:
mat1 <- apply(mat.Var1, 2, function(x) ifelse(x==1, x<-temp.Var1$Var1out, x<-0) )
mat2 <- apply(mat.Var2, 2, function(x) ifelse(x==1, x<-temp.Var2$Var2out, x<-0) )
matX <- mat1+mat2
matX
Var1A Var1B Var1C Var1D Var1E
1 1 0 0 0 -1
2 0 -1 0 0 1
3 0 0 -1 0 1
4 0 0 0 -1 1
5 -1 1 0 0 0
6 1 0 -1 0 0
7 1 0 0 -1 0
8 0 -1 1 0 0
9 0 1 0 -1 0
10 0 0 1 -1 0
Although this works, I have a sense that I am probably missing simpler solutions for this problem. Thanks.

Create an empty matrix and use matrix indexing to fill the relevant values in:
cols <- unique(unlist(dfx[1:2]))
M <- matrix(0, nrow = nrow(dfx), ncol = length(cols), dimnames = list(NULL, cols))
M[cbind(sequence(nrow(dfx)), match(dfx$Var1, cols))] <- dfx$Var1out
M[cbind(sequence(nrow(dfx)), match(dfx$Var2, cols))] <- dfx$Var2out
M
# A B C D E
# [1,] 1 0 0 0 -1
# [2,] 0 -1 0 0 1
# [3,] 0 0 -1 0 1
# [4,] 0 0 0 -1 1
# [5,] -1 1 0 0 0
# [6,] 1 0 -1 0 0
# [7,] 1 0 0 -1 0
# [8,] 0 -1 1 0 0
# [9,] 0 1 0 -1 0
# [10,] 0 0 1 -1 0

Another way is to use acast
library(reshape2)
#added `use.names=FALSE` from #Ananda Mahto's comments
dfy <- data.frame(Var=unlist(dfx[,1:2], use.names=FALSE),
VarOut=unlist(dfx[,3:4], use.names=FALSE), indx=1:nrow(dfx))
acast(dfy, indx~Var, value.var="VarOut", fill=0)
# A B C D E
#1 1 0 0 0 -1
#2 0 -1 0 0 1
#3 0 0 -1 0 1
#4 0 0 0 -1 1
#5 -1 1 0 0 0
#6 1 0 -1 0 0
#7 1 0 0 -1 0
#8 0 -1 1 0 0
#9 0 1 0 -1 0
#10 0 0 1 -1 0
Or use spread
library(tidyr)
spread(dfy,Var, VarOut , fill=0)[,-1]
# A B C D E
#1 1 0 0 0 -1
#2 0 -1 0 0 1
#3 0 0 -1 0 1
#4 0 0 0 -1 1
#5 -1 1 0 0 0
#6 1 0 -1 0 0
#7 1 0 0 -1 0
#8 0 -1 1 0 0
#9 0 1 0 -1 0
#10 0 0 1 -1 0

Vectorizing a for-loop that merges two data frames by column

Suppose I have two dataframes df1 and df2.
df1 <- data.frame(matrix(c(0,0,1,0,0,1,1,1,0,1),ncol=10,nrow=1))
colnames(df1) <- LETTERS[seq(1,10)]
df2 <- data.frame(matrix(c(1,1,1,1),ncol=4,nrow=1))
colnames(df2) <- c("C","D","A","I")
Some of the column names in df2 match column names in df1 and df1 always contains every possible column name that can occur in df2. I want to append df1 with a new row which holds the value of df2 for matching columns and a 0 for non-matching columns. My current approach uses a for-loop:
for(i in 1:ncol(df1)){
if(colnames(df1)[i] %in% colnames(df2)){
df1[2,i] <- df2[1,which(colnames(df2)==colnames(df1)[i])]
} else {
df1[2,i] <- 0
}
}
Well, it works. But I wonder if there is a cleaner (and faster) solution for this task, perhaps taking advantage of vectorized operations.

res <-merge(df1,df2,all=T)[,colnames(df1)]
res[is.na(res)] <- 0
res
# A B C D E F G H I J
# 1 0 0 1 0 0 1 1 1 0 1
# 2 1 0 1 1 0 0 0 0 1 0

Possibly more efficient would be rbind_all from "dplyr":
library(dplyr)
rbind_list(df1, df2)
# A B C D E F G H I J
# 1 0 0 1 0 0 1 1 1 0 1
# 2 1 NA 1 1 NA NA NA NA 1 NA
Assign to "res" and replace NA with "0" in the same way identified by #akrun.

Just using assignment:
df1[2,] <- 0
df1[2,names(df2)] <- df2
# A B C D E F G H I J
#1 0 0 1 0 0 1 1 1 0 1
#2 1 0 1 1 0 0 0 0 1 0
...and just to prove it works with other values:
df2$C <- 8
df1[2,] <- 0
df1[2,names(df2)] <- df2
# A B C D E F G H I J
#1 0 0 1 0 0 1 1 1 0 1
#2 1 0 8 1 0 0 0 0 1 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to merge two dataframes and keep only different columns (content)? - r

Related

find how many times and which columns a string is repeated

Populating data from one data.table to another

Create counter of consecutive runs of a certain value

Transforming dataframe into expanded matrix in r

Vectorizing a for-loop that merges two data frames by column

Categories

Resources