Generate pairwise data.frame of all combinations of two data.frame with different number of rows - r

I have to dataframes a and b that I want to combine in a final dataframe c
a <- data.frame(city=c("a","b","c"),detail=c(1,2,3))
b <- data.frame(city=c("x","y"),detail=c(5,6))
the dataframe c should look like
city.a detail.a city.b detail.b
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
I think I could use crossing from tidyr but for crossing(a,b) I get:
error: Column names `city`, `detail` must not be duplicated.
Use .name_repair to specify repair.

Yes, crossing is the right function but as the error message suggests that column names should be not be duplicated try to change the column names
names(a) <- paste0(names(a), ".a")
names(b) <- paste0(names(b), ".b")
tidyr::crossing(a, b)
# city.a detail.a city.b detail.b
# <fct> <dbl> <fct> <dbl>
#1 a 1 x 5
#2 a 1 y 6
#3 b 2 x 5
#4 b 2 y 6
#5 c 3 x 5
#6 c 3 y 6
crossing is a wrapper over expand_grid so after correcting the names you can also use it directly.
tidyr::expand_grid(a, b)

Here is a base R solution by using rep() + cbind(), which gives duplicated column names:
C <- `row.names<-`(cbind(a[rep(seq(nrow(a)),each = nrow(b)),],b),NULL)
such that
> C
city detail city detail
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6
Or get a data frame having different column names by using data.frame():
C <- data.frame(a[rep(seq(nrow(a)),each = nrow(b)),],b,row.names = NULL)
such that
> C
city detail city.1 detail.1
1 a 1 x 5
2 a 1 y 6
3 b 2 x 5
4 b 2 y 6
5 c 3 x 5
6 c 3 y 6

With base R, we can use merge
merge(setNames(a, paste0(names(a), ".a")), b)
# city.a detail.a city detail
#1 a 1 x 5
#2 b 2 x 5
#3 c 3 x 5
#4 a 1 y 6
#5 b 2 y 6
#6 c 3 y 6

Related

Creating an identifier using pairs of row indices [duplicate]

I would like to generate indices to group observations based on two columns. But I want groups to be made of observation that share, at least one observation in commons.
In the data below, I want to check if values in 'G1' and 'G2' are connected directly (appear on the same row), or indirectly via other intermediate values. The desired grouping variable is shown in 'g'.
For example, A is directly linked to Z (row 1) and X (row 2). A is indirectly linked to 'B' via X (A -> X -> B), and further linked to Y via X and B (A -> X -> B -> Y).
dt <- data.frame(id = 1:10,
G1 = c("A","A","B","B","C","C","C","D","E","F"),
G2 = c("Z","X","X","Y","W","V","U","s","T","T"),
g = c(1,1,1,1,2,2,2,3,4,4))
dt
# id G1 G2 g
# 1 1 A Z 1
# 2 2 A X 1
# 3 3 B X 1
# 4 4 B Y 1
# 5 5 C W 2
# 6 6 C V 2
# 7 7 C U 2
# 8 8 D s 3
# 9 9 E T 4
# 10 10 F T 4
I tried with group_indices from dplyr, but haven't managed it.
Using igraph get membership, then map on names:
library(igraph)
# convert to graph, and get clusters membership ids
g <- graph_from_data_frame(df1[, c(2, 3, 1)])
myGroups <- components(g)$membership
myGroups
# A B C D E F Z X Y W V U s T
# 1 1 2 3 4 4 1 1 1 2 2 2 3 4
# then map on names
df1$group <- myGroups[df1$G1]
df1
# id G1 G2 group
# 1 1 A Z 1
# 2 2 A X 1
# 3 3 B X 1
# 4 4 B Y 1
# 5 5 C W 2
# 6 6 C V 2
# 7 7 C U 2
# 8 8 D s 3
# 9 9 E T 4
# 10 10 F T 4

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

R: fill a new column in a data frame with a value by matching variables in reverse

I apologize for the title of this question. I can't figure out how a good way to briefly describe what I want to do.
I have something like this, with >8000 rows:
x y value_xy
A B 7
A C 2
B A 3
B C 6
C A 2
C B 1
I want to create a new column, value_yx, that looks like this:
x y value_xy value_yx
A B 7 3
A C 2 2
B A 3 7
B C 1 1
C A 2 2
C B 1 1
For each value of x and y, I want to have a new column that finds the value of y to x (as y appears later in the x column). Sometimes these values are equal, other times they aren't.
I have explored using for loops, ave(), and several other functions, but I haven't been able to make it work.
Try merge. The by.x and by.y arguments specify columns to be matched, and here the order of matching columns is reversed in by.y:
merge(x = df, y = df, by.x = c("x", "y"), by.y = c("y", "x"))
# x y value_xy.x value_xy.y
# 1 A B 7 3
# 2 A C 2 2
# 3 B A 3 7
# 4 B C 6 1
# 5 C A 2 2
# 6 C B 1 6
Looks like I was beat to it but an alternative solution with mapply
df$value_yx = mapply(function(x_flip, y_flip) df[df$x == y_flip & df$y == x_flip,]$value_xy, df$x, df$y)
# x y value_xy value_yx
#1 A B 7 3
#2 A C 2 2
#3 B A 3 7
#4 B C 6 1
#5 C A 2 2
#6 C B 1 6
xtabs will return a value-matrix that can be indexed by a two-column, character-valued matrix formed from the first two columns and are probably factors (hence the need for the as.character()-conversion:
> dfrm$value_yx <- xtabs(value_xy~x+y, dfrm)[
sapply(dfrm[2:1],as.character) ]
> dfrm
x y value_xy value_yx
1 A B 7 3
2 A C 2 2
3 B A 3 7
4 B C 6 1
5 C A 2 2
6 C B 1 6
--- See what is being indexed
> xtabs(value_xy~x+y, dfrm)
y
x A B C
A 0 7 2
B 3 0 6
C 2 1 0

Subseting data frame by another data frame

The data is as follows:
> x
a b
1 1 a
2 2 a
3 3 a
4 1 b
5 2 b
6 3 b
> y
a b
1 2 a
2 3 a
3 3 b
My goal is to compare both data frames, and for each row in x indicate whether equivalent row exists in y. All of the y rows are actually contained in x, so I would like to end up with something like this:
> x
a b intersect.x.y
1 1 a F
2 2 a T
3 3 a T
4 1 b F
5 2 b F
6 3 b T
How about that?
How about this?
x$rn <- 1:nrow(x)
xyrows <- merge(x,y)$rn # maybe you just want to look at the merge ...?
x$iny <- FALSE
x$iny[xyrows] <- TRUE
I suspect there is a more standard approach, but this way is easy to understand.

Resources