R merge() rbinds instead of merging - r

I ran across a behaviour of merge() in R that I can't understand. It seems that it either merges or rbinds data frames depending on whether a column has one or more unique values in it.
a1 <- data.frame (A = c (1, 1))
a2 <- data.frame (A = c (1, 2))
# > merge (a1, a1)
# A
# 1 1
# 2 1
# 3 1
# 4 1
# > merge (a2, a2)
# A
# 1 1
# 2 2
The latter is the result that I would expect, and want, in both cases. I also tried having more than one column, as well as characters instead of numerals, and the results are the same: multiple values result in merging, one unique value results in rbinding.

In the first case each row matches two rows so there are 2x2=4 rows in the output and in the second case each row matches one row so there are 2 rows in the output.
To match on row number use this:
merge(a1, a1, by = 0)
## Row.names A.x A.y
## 1 1 1 1
## 2 2 1 1
or match on row number and only return the left instance:
library(sqldf)
sqldf("select x.* from a1 x left join a1 y on x.rowid = y.rowid")
## A
## 1 1
## 2 1
or match on row number and return both instances:
sqldf("select x.A A1, y.A A2 from a1 x left join a1 y on x.rowid = y.rowid")
## A1 A2
## 1 1 1
## 2 1 1

The behaviour is detailed in the documentation but, basically, merge() will, by default, want to give you a data.frame with columns taken from both original dfs. It is going to merge rows of the two by unique values of all common columns.
df1 <- data.frame(a = 1:3, b = letters[1:3])
df2 <- data.frame(a = 1:5, c = LETTERS[1:5])
df1
a b
1 1 a
2 2 b
3 3 c
df2
a c
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
merge(df1, df2)
a b c
1 1 a A
2 2 b B
3 3 c C
What's happening in your first example is that merge() wants to combine the rows of your two data frames by the A column but because both rows in both dfs are the same, it can't figure out which row to merge with which so it creates all possible combinations.
In your second example, you don't have this problem and so merging is unambiguous. The 1 rows will get merged together as will the 2 rows.
The scenarios are more apparent when you have multiple columns in your dfs:
Case 1:
> df1 <- data.frame(a = c(1, 1), b = letters[1:2])
> df2 <- data.frame(a = c(1, 1), c = LETTERS[1:2])
> df1
a b
1 1 a
2 1 b
> df2
a c
1 1 A
2 1 B
> merge(df1, df2)
a b c
1 1 a A
2 1 a B
3 1 b A
4 1 b B
Case 2:
> df1 <- data.frame(a = c(1, 2), b = letters[1:2])
> df2 <- data.frame(a = c(1, 2), c = LETTERS[1:2])
> df1
a b
1 1 a
2 2 b
> df2
a c
1 1 A
2 2 B
> merge(df1, df2)
a b c
1 1 a A
2 2 b B

Related

Insert and copy row in data frame in R

I have a large data frame that looks like this:
df
X1 X2
1 A B
2 A C
And another that looks like this
df2
Type Group
1 Train A
2 Boat B
3 Car A
4 Hangar C
I want to insert df2 into df1 and copy the entire row every time I insert so I end up with this
X1 X2 X3
1 A B Train
2 A B Car
3 A B Boat
4 A C Train
5 A C Car
6 A C Hangar
What is the best way to do this in R? Cant figure this out.
I am not sure if I understand your aim correctly, but below is my base R attempt
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
1:nrow(df2),
function(k) {
cbind(
df[which(df == df2$Group[k], arr.ind = TRUE)[, "row"], ],
X3 = df2$Type[k]
)
}
)
)
)
which gives
X1 X2 X3
1 A B Train
2 A C Train
3 A B Boat
4 A B Car
5 A C Car
6 A C Hangar

Calculate Euclidian distances between elements of two data sets

I have two data sets of the same size:
>df1
c d e
a 2 3 4
b 5 1 3
>df2
h i j
f 1 1 2
g 0 4 3
I need to calculate Euclidian distances between the same elements of these data sets to get:
c d e
a 1 2 2
b 5 3 0
I have tried using dist(rbind(df1, df2)), but the result gave only one entry.
I have to perform this operation with numerous data sets, that's why your help will be really appreciated.
The following will work if the data frames are all numeric and have the same column and row numbers.
df3 <- abs(df1 - df2)
df3
# c d e
# a 1 2 2
# b 5 3 0
DATA
df1 <- read.table(text = " c d e
a 2 3 4
b 5 4 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)
df2 <- read.table(text = " h i j
f 1 1 2
g 0 1 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)
Given your update the solution would be to do absolute value (abs) of the difference:
abs(df1 - df2)
And you could make a function if you want to repeat the process a lot:
myfunc1 <- function(x1,x2){
abs(x1 - x2)
}
myfunc1(df1, df2)
The output looks as intended:
[,1] [,2] [,3]
[1,] 1 2 2
[2,] 5 3 0

Combine of information based on two dataframes in R

This is my sample data
> data.frame
a b c d
W_1_N NA NA NA NA
W_1_E 2 2 2 4
W_1_C 4 2 2 4
W_1_D NA NA NA NA
First I had to combine elements from matrix to get pairs of column names of them, where one of element is 4 and another is 2 in the same row.
In a result it looks like this
W_1_E.1 d a
W_1_E.2 d b
W_1_E.3 d c
W_1_C.1 a b
W_1_C.2 a c
W_1_C.3 d b
W_1_C.4 d c
I wanted only pairs where one element is 4 and other is 2 in the same row. W_1_N and W_1_D have only NA so was ommited. W_1_E appears in 3 rows because there are 3 pairs of (4,2) in row in sample data.W_1_C has 4 pairs.
This is code:
lst=data.frame(df) %>%
rownames_to_column("rn") %>%
drop_na() %>%
gather(key, value, -rn) %>%
group_by(rn, value) %>%
summarise(l = list(unique(key))) %>%
split(.$rn)
pair=do.call("rbind", lapply(lst, function(x) expand.grid(x$l[[1]],
x$l[[2]])))
It works perfectly, but now I have second data.frame:
a b c d
W_1_N 0 1 1 1
W_1_E 1 1 0 0
W_1_C 1 1 1 0
W_1_D 1 0 1 1
Here is my problem, I want to get only this pairs where value of both elements of pair is 1 in second data.frame. For example first pair of my result W_1_E.1 d a should be eliminated because d has value 0 in W_1_E row in second data.frame.
The output should be:
W_1_C.1 a b
W_1_C.2 a c
d has value 0 in W_1_E row, so all rows with W_1_E in my result data.frame were eliminated (all pars were with d). The last two rows were eliminated because d is also 0 in W_1_C row in second dataframe.
Thanks for your help
How's this?
x <- "N a b c d
W_1_N NA NA NA NA
W_1_E 2 2 2 4
W_1_C 4 2 2 4
W_1_D NA NA NA NA "
x1 <- read.table(text = x, header = TRUE)
x <- "N a b c d
W_1_N 0 1 1 1
W_1_E 1 1 0 0
W_1_C 1 1 1 0
W_1_D 1 0 1 1 "
x2 <- read.table(text = x, header = TRUE)
df <- merge(x1, x2, by="N")
df$a <- ifelse(df$a.y == 0,NA,df$a.x)
df$b <- ifelse(df$b.y == 0,NA,df$b.x)
df$c <- ifelse(df$c.y == 0,NA,df$c.x)
df$d <- ifelse(df$d.y == 0,NA,df$d.x)
df <- df[ , c(1,10:13)]
library(tidyr)
df_all <- df %>%
gather(key = key1, value, 2:5)
df2 <- df_all[!is.na(df_all$value) & df_all$value == 2,]
df4 <- df_all[!is.na(df_all$value) & df_all$value == 4,]
merge(df2[,1:2], df4[1:2], by = "N", all.x = FALSE, all.y = FALSE)

Removing duplicate rows in dataframes with multiple columns

In a dataframe like this
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b)
it is possible to remove duplicates using (based on the results of b column) :
df[!duplicated(df), ]
if a have a third column c in the df and I would like again to remove the duplicate based on the values of column b is it right to use this:
df[!duplicated(df$b), ]
using a third column.
The dataframe:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c("i","i","ii","ii","iii","iii","iv","iv")
df <-data.frame(a,b,c)
using remove duplicates based on column b:
df[!duplicated(df$b), ]
the result is this:
a b c
A 1 i
A 2 ii
B 4 ii
and I would expect this
a b c
A 1 i
A 2 ii
B 4 ii
B 1 iii
C 2 iv
Input:
a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
c <- c("i","i","ii","ii","iii","iii","iv","iv")
df <-data.frame(a,b,c)
Described as expected output in post:
a b c
A 1 i
A 2 ii
B 4 ii
B 1 iii
C 2 iv
Using distinct on all columns seems to do what you want:
>library(dplyr)
>distinct(df)
a b c
1 A 1 i
2 A 2 ii
3 B 4 ii
4 B 1 iii
5 C 2 iv
Other variation: only allow unique b's:
> distinct(df,b, .keep_all = TRUE)
a b c
1 A 1 i
2 A 2 ii
3 B 4 ii

R: Reshape count matrix to long format with multiple entries

I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))

Resources