Insert and copy row in data frame in R - r

I have a large data frame that looks like this:
df
X1 X2
1 A B
2 A C
And another that looks like this
df2
Type Group
1 Train A
2 Boat B
3 Car A
4 Hangar C
I want to insert df2 into df1 and copy the entire row every time I insert so I end up with this
X1 X2 X3
1 A B Train
2 A B Car
3 A B Boat
4 A C Train
5 A C Car
6 A C Hangar
What is the best way to do this in R? Cant figure this out.

I am not sure if I understand your aim correctly, but below is my base R attempt
do.call(
rbind,
c(
make.row.names = FALSE,
lapply(
1:nrow(df2),
function(k) {
cbind(
df[which(df == df2$Group[k], arr.ind = TRUE)[, "row"], ],
X3 = df2$Type[k]
)
}
)
)
)
which gives
X1 X2 X3
1 A B Train
2 A C Train
3 A B Boat
4 A B Car
5 A C Car
6 A C Hangar

Related

R merge() rbinds instead of merging

I ran across a behaviour of merge() in R that I can't understand. It seems that it either merges or rbinds data frames depending on whether a column has one or more unique values in it.
a1 <- data.frame (A = c (1, 1))
a2 <- data.frame (A = c (1, 2))
# > merge (a1, a1)
# A
# 1 1
# 2 1
# 3 1
# 4 1
# > merge (a2, a2)
# A
# 1 1
# 2 2
The latter is the result that I would expect, and want, in both cases. I also tried having more than one column, as well as characters instead of numerals, and the results are the same: multiple values result in merging, one unique value results in rbinding.
In the first case each row matches two rows so there are 2x2=4 rows in the output and in the second case each row matches one row so there are 2 rows in the output.
To match on row number use this:
merge(a1, a1, by = 0)
## Row.names A.x A.y
## 1 1 1 1
## 2 2 1 1
or match on row number and only return the left instance:
library(sqldf)
sqldf("select x.* from a1 x left join a1 y on x.rowid = y.rowid")
## A
## 1 1
## 2 1
or match on row number and return both instances:
sqldf("select x.A A1, y.A A2 from a1 x left join a1 y on x.rowid = y.rowid")
## A1 A2
## 1 1 1
## 2 1 1
The behaviour is detailed in the documentation but, basically, merge() will, by default, want to give you a data.frame with columns taken from both original dfs. It is going to merge rows of the two by unique values of all common columns.
df1 <- data.frame(a = 1:3, b = letters[1:3])
df2 <- data.frame(a = 1:5, c = LETTERS[1:5])
df1
a b
1 1 a
2 2 b
3 3 c
df2
a c
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
merge(df1, df2)
a b c
1 1 a A
2 2 b B
3 3 c C
What's happening in your first example is that merge() wants to combine the rows of your two data frames by the A column but because both rows in both dfs are the same, it can't figure out which row to merge with which so it creates all possible combinations.
In your second example, you don't have this problem and so merging is unambiguous. The 1 rows will get merged together as will the 2 rows.
The scenarios are more apparent when you have multiple columns in your dfs:
Case 1:
> df1 <- data.frame(a = c(1, 1), b = letters[1:2])
> df2 <- data.frame(a = c(1, 1), c = LETTERS[1:2])
> df1
a b
1 1 a
2 1 b
> df2
a c
1 1 A
2 1 B
> merge(df1, df2)
a b c
1 1 a A
2 1 a B
3 1 b A
4 1 b B
Case 2:
> df1 <- data.frame(a = c(1, 2), b = letters[1:2])
> df2 <- data.frame(a = c(1, 2), c = LETTERS[1:2])
> df1
a b
1 1 a
2 2 b
> df2
a c
1 1 A
2 2 B
> merge(df1, df2)
a b c
1 1 a A
2 2 b B

Calculate Euclidian distances between elements of two data sets

I have two data sets of the same size:
>df1
c d e
a 2 3 4
b 5 1 3
>df2
h i j
f 1 1 2
g 0 4 3
I need to calculate Euclidian distances between the same elements of these data sets to get:
c d e
a 1 2 2
b 5 3 0
I have tried using dist(rbind(df1, df2)), but the result gave only one entry.
I have to perform this operation with numerous data sets, that's why your help will be really appreciated.
The following will work if the data frames are all numeric and have the same column and row numbers.
df3 <- abs(df1 - df2)
df3
# c d e
# a 1 2 2
# b 5 3 0
DATA
df1 <- read.table(text = " c d e
a 2 3 4
b 5 4 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)
df2 <- read.table(text = " h i j
f 1 1 2
g 0 1 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)
Given your update the solution would be to do absolute value (abs) of the difference:
abs(df1 - df2)
And you could make a function if you want to repeat the process a lot:
myfunc1 <- function(x1,x2){
abs(x1 - x2)
}
myfunc1(df1, df2)
The output looks as intended:
[,1] [,2] [,3]
[1,] 1 2 2
[2,] 5 3 0

R: Reshape count matrix to long format with multiple entries

I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?
We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))

Create a table with defined levels in R

I have a vector of actual values (factor variables) a a a b b c c b c b c c ...
I have a vector of predict values a b b a ...
I want to create a confusion matrix t = table (actual, predict)
If I print t, it will looks like:
a b c
a 4 3 1
b 1 5 2
c 3 1 8
However, I want to print it as
b c a
b 8 2 3
c 2 3 4
a 1 2 3
(i.e, I want to change the order of row and columns, but keep it as a confusion matrix)
How could I do that in R?
We could change/convert the columns to factor with levels specified.
actual <- factor(actual, levels=c('b', 'c', 'a'))
predict <- factor( predict, levels = c('b', 'c', 'a'))
table(actual, predict)
# predict
#actual b c a
# b 0 1 4
# c 3 2 2
# a 2 3 3
Or we can use row/column indexing
table(actual, predict)[c('b','c','a'), c('b', 'c', 'a')]
data
set.seed(24)
actual <- sample(letters[1:3], 20, replace=TRUE)
predict <- sample(letters[1:3], 20, replace=TRUE)

Order data frame by columns with different calling schemes

Say I have the following data frame:
df <- data.frame(x1 = c(2, 2, 2, 1),
x2 = c(3, 3, 2, 1),
let = c("B", "A", "A", "A"))
df
x1 x2 let
1 2 3 B
2 2 3 A
3 2 2 A
4 1 1 A
If I want to order df by x1, then x2 then let, I do this:
df2 <- df[with(df, order(x1, x2, let)), ]
df2
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
However, x1 and x2 have actually been saved as an id <- c("x1", "x2") vector earlier in the code, which I use for other purposes.
So my problem is that I want to reference id instead of x1 and x2 in my order function, but unfortunately anything like df[order(df[id], df$let), ] will result in a argument lengths differ error.
From what I can tell (and this has been addressed at another SO thread), the problem is that length(df[id]) == 2 and length(df$let) == 4.
I have been able to make it through this workaround:
df3 <- df[order(df[, id[1]], df[, id[2]], df[, "let"]), ]
df3
x1 x2 let
4 1 1 A
3 2 2 A
2 2 3 A
1 2 3 B
But it looks ugly and depends on knowing the size of id.
Is there a more elegant solution to sorting my data frame by id then let?
I would suggest using do.call(order, ...) and combining id and "let" with c():
id <- c("x1", "x2")
df[do.call(order, df[c(id, "let")]), ]
# x1 x2 let
# 4 1 1 A
# 3 2 2 A
# 2 2 3 A
# 1 2 3 B

Resources