Combination of all pairs of rows using R - r

Here is my dataset:
data <- read.table(header = TRUE, text = "
group index group_index x y z
a 1 a1 12 13 14
a 2 a2 15 20 22
b 1 b1 24 17 28
b 2 b2 12 19 30
b 3 b3 31 32 33 ")
For each case in group "a" and each case in group "b", I wanna combine their x, y, z values in a row, so the data matrix or dataframe I want will look like:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] a1_b1 12 13 14 24 17 28 # x,y,z for a1, follows by x,y,z for b1
[2,] a1_b2 12 13 14 12 19 30 # x,y,z for a1, follows by x,y,z for b2
[3,] a1_b3 12 13 14 31 32 33
[4,] a2_b1 15 20 22 24 17 28 # x,y,z for a2, follows by x,y,z for b1
[5,] a2_b2 15 20 22 12 19 30
[6,] a2_b3 15 20 22 31 32 33
I'm wondering how to achieve this goal? Thanks so much!

We can split data based on group and take a cartesian product using merge
list_df <- split(data[c("x", "y", "z")], data$group)
out <- merge(list_df[[1]], list_df[[2]], by = NULL)
out[do.call(order, out), ]
# x.x y.x z.x x.y y.y z.y
#3 12 13 14 12 19 30
#1 12 13 14 24 17 28
#5 12 13 14 31 32 33
#4 15 20 22 12 19 30
#2 15 20 22 24 17 28
#6 15 20 22 31 32 33

You could also do a join on non-matching group values (< instead of != to avoid repeating pairs)
library(data.table)
setDT(data)
data[data, on = .(group < group),
.(g = paste0(group_index, '_', i.group_index),
x, y, z, i.x, i.y, i.z),
nomatch = NULL]
# g x y z i.x i.y i.z
# 1: a1_b1 12 13 14 24 17 28
# 2: a2_b1 15 20 22 24 17 28
# 3: a1_b2 12 13 14 12 19 30
# 4: a2_b2 15 20 22 12 19 30
# 5: a1_b3 12 13 14 31 32 33
# 6: a2_b3 15 20 22 31 32 33

A simple solution using dplyr:
library(tidyverse)
dcross <- left_join(data, data, by=character(), suffix=c("1", "2")) |>
filter(group1 != group2)
# index1 group_index1 x1 y1 index2 group_index2 x2 y2
# 1 1 a1 12 13 1 b1 24 17
# 2 1 a1 12 13 2 b2 12 19
# 3 1 a1 12 13 3 b3 31 32
# 4 2 a2 15 20 1 b1 24 17
# 5 2 a2 15 20 2 b2 12 19
# 6 2 a2 15 20 3 b3 31 32
And to get the described matrix from the dataframe
dcross |>
select(matches("^[xyz]\\d")) |>
as.matrix()
# x1 y1 z1 x2 y2 z2
# [1,] 12 13 14 24 17 28
# [2,] 12 13 14 12 19 30
# [3,] 12 13 14 31 32 33
# [4,] 15 20 22 24 17 28
# [5,] 15 20 22 12 19 30
# [6,] 15 20 22 31 32 33

Related

Transpose from long to wide with pair groups in r

I have descriptive statistics for four groups. My sample dataset is:
df <- data.frame(
Grade = c(3,3,3,3,4,4,4,4),
group = c("none","G1","G2","both","none","G1","G2","both"),
mean=c(10,12,13,12,11,18,19,20),
sd=c(22,12,22,12,11,13,14,15),
N=c(35,33,34,32,43,45,46,47))
> df
Grade group mean sd N
1 3 none 10 22 35
2 3 G1 12 12 33
3 3 G2 13 22 34
4 3 both 12 12 32
5 4 none 11 11 43
6 4 G1 18 13 45
7 4 G2 19 14 46
8 4 both 20 15 47
I would like to compare groups as pairs and need the descriptive information side by side for each pair.
Here is what I would like to have:
So, each grade has 6 pairs of groups.
Does anyone have any idea on this?
Thanks!
1) sqldf We can join df to itself on the indicated condition. Note that we escaped group since group is an sql keyword.
library(sqldf)
sqldf('select
a.Grade,
a.[group] Group1, b.[group] Group2,
a.mean mean1, b.mean mean2,
a.sd sd1, b.sd sd2,
a.N n1, b.N n2
from df a
join df b on a.Grade = b.Grade and a.[group] > b.[group]')
giving:
Grade Group1 Group2 mean1 mean2 sd1 sd2 n1 n2
1 3 none G1 10 12 22 12 35 33
2 3 none G2 10 13 22 22 35 34
3 3 none both 10 12 22 12 35 32
4 3 G2 G1 13 12 22 12 34 33
5 3 both G1 12 12 12 12 32 33
6 3 both G2 12 13 12 22 32 34
7 4 none G1 11 18 11 13 43 45
8 4 none G2 11 19 11 14 43 46
9 4 none both 11 20 11 15 43 47
10 4 G2 G1 19 18 14 13 46 45
11 4 both G1 20 18 15 13 47 45
12 4 both G2 20 19 15 14 47 46
2) base R We can perform a merge on part of the condition and then subset it for the remainder. The names are slightly different so you will need to change them if that is important.
subset(merge(df, df, by = "Grade"), group.x > group.y)
giving:
Grade group.x mean.x sd.x N.x group.y mean.y sd.y N.y
2 3 none 10 22 35 G1 12 12 33
3 3 none 10 22 35 G2 13 22 34
4 3 none 10 22 35 both 12 12 32
8 3 G1 12 12 33 both 12 12 32
10 3 G2 13 22 34 G1 12 12 33
12 3 G2 13 22 34 both 12 12 32
18 4 none 11 11 43 G1 18 13 45
19 4 none 11 11 43 G2 19 14 46
20 4 none 11 11 43 both 20 15 47
24 4 G1 18 13 45 both 20 15 47
26 4 G2 19 14 46 G1 18 13 45
28 4 G2 19 14 46 both 20 15 47

Can't read correctly the value type of dataframe elements

I have a data frame
SSIM_BEST=
X1 X2 X3 X4 X5
1 1 36 0.939323 B4 ON
2 1 35 0.943645 B2 ON
3 1 34 0.948516 B2 ON
4 1 33 0.952599 ZL ON
5 1 32 0.956492 ZL ON
6 1 31 0.960432 ZL ON
7 1 30 0.963957 ZL ON
8 1 29 0.96664 ZL ON
9 1 28 0.969612 ZL ON
10 1 27 0.97234 ZL ON
11 1 26 0.97478 ZL ON
12 1 25 0.977332 ZL ON
13 1 24 0.979606 ZL ON
14 1 23 0.981423 ZL ON
15 1 22 0.983776 ZL ON
I have for loop to read some values from X3 column, like:
SSIM=c()
for (j in seq(1,dim(SSIM_BEST)[1], by=2)) {
SSIM= c(SSIM, SSIM_BEST$X3[[j]]))
}
Instead of getting values like 0.939323,0.948516... I get SSIM=20 27 33 39 44 52 56 61 and I don't know what is going on.
In case I use print(SSIM_BEST$X3[[j]]) in the for-loop I get something like:
[1] 0.939323
72 Levels: 0.894559 0.899583 0.901154 0.907706 0.914609 0.914673 0.91996 0.920569 0.922076 0.925761 0.925897 0.926495 0.928728 0.931108 ... 0.992964
P.S. SSIM_BEST contains more than 15 rows. I show 15 here for example purposes.
Can you help me please?
We can create TRUE/FALSE vector to subset.
# data
SSIM_BEST <- read.table(text ="
X1 X2 X3 X4 X5
1 1 36 0.939323 B4 ON
2 1 35 0.943645 B2 ON
3 1 34 0.948516 B2 ON
4 1 33 0.952599 ZL ON
5 1 32 0.956492 ZL ON
6 1 31 0.960432 ZL ON
7 1 30 0.963957 ZL ON
8 1 29 0.96664 ZL ON
9 1 28 0.969612 ZL ON
10 1 27 0.97234 ZL ON
11 1 26 0.97478 ZL ON
12 1 25 0.977332 ZL ON
13 1 24 0.979606 ZL ON
14 1 23 0.981423 ZL ON
15 1 22 0.983776 ZL ON", header = TRUE)
# get odd rows
SSIM_BEST[c(TRUE, FALSE), "X3"]
# more generic solution
mySkip = 2
SSIM_BEST[seq(nrow(SSIM_BEST)) %% mySkip == 1, "X3"]
I think its because SSIM_BEST$X3 is a factor. I'm willing to bet the values you get from the for loop are the levels of the factors.
I have a couple options that should both work.
SSIM=c()
SSIM_BEST$X3 <- as.numeric(SSIM_BEST$X3)
for (j in seq(1,dim(SSIM_BEST)[1], by=2)) {
SSIM= c(SSIM, SSIM_BEST$X3[[j]]))
}
Or
SSIM=c()
for (j in seq(1,dim(SSIM_BEST)[1], by=2)) {
SSIM= c(SSIM, as.numeric(SSIM_BEST$X3[[j]])))
}
As Frank said, for loops are not good. I wrote a simple function that can do what you want without a for loop.
getDat <- function(data,by=2,start=1) {
v <- (1:length(data) %% by == 1)
if(start > 1){
v <- c(v,rep(F,start-1))
v <- shift(v,start-1)
is.na(v) <- FALSE
v <- v[1:(length(v) - (start-1))]
}
data <- data[v]
data[!is.na(data)]
}
This also allows you to specify where to start in the vector.
x <- 1:50
getDat(x,2)
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
getDat(x,2,2)
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
getDat(x,3,10)
[1] 10 13 16 19 22 25 28 31 34 37 40 43 46 49

Add data frames row wise with [d]plyr

I have two data frames
df1
# a b
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25
df2
# a b
# 1 4 8
I want the following output:
df3
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
i.e. add df2 to each row of df1.
Is there a way to get the desired output using plyr (mdplyr??) or dplyr?
I see no reason for "dplyr" for something like this. In base R you could just do:
df1 + unclass(df2)
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
Which is the same as df1 + list(4, 8).
One liner with dplyr.
mutate_each(df1, funs(.+ df2$.), a:b)
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33
A base R solution using sweet function sweep:
sweep(df1, 2, unlist(df2), '+')
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33

Combine two dataframes one above the other

I have two dataframes and I want to put one above the other "with" column names of second as a row of the new dataframe. Column names are different and one dataframe has more columns.
For example:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1
V1 V2
1 1 21
2 2 22
3 3 23
4 4 24
5 5 25
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
mydf2
C1 C2 C3
1 1 21 41
2 2 22 42
3 3 23 43
4 4 24 44
5 5 25 45
6 6 26 46
7 7 27 47
8 8 28 48
9 9 29 49
10 10 30 50
Result:
mydf
V1 V2
1 1 21 NA
2 2 22 NA
3 3 23 NA
4 4 24 NA
5 5 25 NA
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
I dont care if all numeric values treated like characters.
Many thanks
You can do this easily without any packages:
mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
mydf1[,3] <- NA
names(mydf1) <- c("one", "two", "three")
mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
names <- t(as.data.frame(names(mydf2)))
names <- as.data.frame(names)
names(mydf2) <- c("one", "two", "three")
names(names) <- c("one", "two", "three")
mydf3 <- rbind(mydf1, names)
mydf4 <- rbind(mydf3, mydf2)
> mydf4
one two three
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
>
Of course, you can edit the <- c("one", "two", "three") to make the final column names whatever you'd like. For example:
> mydf1 <- data.frame(V1=c(1:5), V2=c(21:25))
> mydf1[,3] <- NA
> names(mydf1) <- c("V1", "V2", "NA")
> mydf2 <- data.frame(C1=c(1:10), C2=c(21:30),C3=c(41:50))
> names <- t(as.data.frame(names(mydf2)))
> names <- as.data.frame(names)
> names(mydf2) <- c("V1", "V2", "NA")
> names(names) <- c("V1", "V2", "NA")
> mydf3 <- rbind(mydf1, names)
> mydf4 <- rbind(mydf3, mydf2)
> row.names(mydf4) <- NULL
> mydf4
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
If you need to resort a package for any reason when scaling this up to your real use case, then try melt from reshape2 or the package plyr. However, use of a package shouldn't be necessary.
I don't know what you tried with write.table, but that seems to me like the way to go.
I would create a function something like this:
myFun <- function(...) {
L <- list(...)
temp <- tempfile()
maxCol <- max(vapply(L, ncol, 1L))
lapply(L, function(x)
suppressWarnings(
write.table(x, file = temp, row.names = FALSE,
sep = ",", append = TRUE)))
read.csv(temp, header = FALSE, fill = TRUE,
col.names = paste0("New_", sequence(maxCol)),
stringsAsFactors = FALSE)
}
Usage would then simply be:
myFun(mydf1, mydf2)
# New_1 New_2 New_3
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
The function is written such that you can specify more than two data.frames as input:
mydf3 <- data.frame(matrix(1:8, ncol = 4))
myFun(mydf1, mydf2, mydf3)
# New_1 New_2 New_3 New_4
# 1 V1 V2
# 2 1 21
# 3 2 22
# 4 3 23
# 5 4 24
# 6 5 25
# 7 C1 C2 C3
# 8 1 21 41
# 9 2 22 42
# 10 3 23 43
# 11 4 24 44
# 12 5 25 45
# 13 6 26 46
# 14 7 27 47
# 15 8 28 48
# 16 9 29 49
# 17 10 30 50
# 18 X1 X2 X3 X4
# 19 1 3 5 7
# 20 2 4 6 8
Here's one approach with the rbind.fill function (part of the plyr package).
library(plyr)
setNames(rbind.fill(setNames(mydf1, names(mydf2[seq(mydf1)])),
rbind(names(mydf2), mydf2)), names(mydf1))
V1 V2 NA
1 1 21 <NA>
2 2 22 <NA>
3 3 23 <NA>
4 4 24 <NA>
5 5 25 <NA>
6 C1 C2 C3
7 1 21 41
8 2 22 42
9 3 23 43
10 4 24 44
11 5 25 45
12 6 26 46
13 7 27 47
14 8 28 48
15 9 29 49
16 10 30 50
Give this a try.
Assign the column names from the second data set to a vector, and then replace the second set's names with the names from the first set. Then create a list where the middle element is the vector you assigned. Now when you call rbind, it should be fine since everything is in the right order.
d1$V3 <- NA
nm <- names(d2)
names(d2) <- names(d1)
dc <- do.call(rbind, list(d1,nm,d2))
rownames(dc) <- NULL
dc

Subsetting top 4 observations of each unique ID

I have a dataframe of 4 columns and a few thousands rows. I am ordering the dataframe according to thier 4th column-which is their ID-(descending) then to the second column (ascending). Here's what my data looks like:
X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
24 19 16 23
19 20 7 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
16 20 6 21
I am trying to keep the highest 4 values of each ID (if available), my desired output would be
X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
# note that 2 of the 23 ID observations and one of the 21 ID observations were removed.
I was wondering if there is there some short command that can do the job for me? I can think of a command that is around 1 page long! which is subsetting the data according to the 4th column, taking the top 5, then rbind them again. But that sounds so unprofessional!
Here's a command to generate similar example:
m0 <- matrix(0, 100, 4)
df <- data.frame(apply(m0, c(1,2), function(x) sample(c(0:25),1)))
##fix(df)
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
Thanks all.
maybe data.table:
require(data.table)
df<-read.table(header=T,text=" X1 X2 X3 X4
24 1 23 25
21 3 19 25
19 6 20 25
11 12 14 25
14 9 21 24
3 12 25 24
24 15 23 24
8 1 4 23
17 4 12 23
16 11 23 23
20 19 21 23
24 19 16 23
19 20 7 23
19 22 22 22
11 2 18 21
15 9 19 21
10 14 9 21
17 15 19 21
16 20 6 21")
data.table(df)[,.SD[order(X2)][1:4,],by="X4"][!is.na(X3)][,list(X1,X2,X3,X4)]
X1 X2 X3 X4
1: 24 1 23 25
2: 21 3 19 25
3: 19 6 20 25
4: 11 12 14 25
5: 14 9 21 24
6: 3 12 25 24
7: 24 15 23 24
8: 8 1 4 23
9: 17 4 12 23
10: 16 11 23 23
11: 20 19 21 23
12: 19 22 22 22
13: 11 2 18 21
14: 15 9 19 21
15: 10 14 9 21
16: 17 15 19 2
here's what's happening in the data.table call:
data.table(df)[ # data.table of df
,.SD[ # for each by=X4, .SD is the sub-table
order(X2)][1:4,], # first four entries ordered by X2
by="X4"][ # X4 is the grouping variable
!is.na(X3)][ # filter out NAs (i.e. less than 4 entries per row)
,list(X1,X2,X3,X4)] # order the columns
I think that Thomas's solution is fine, but can be improved. I would guess that the splitting, recombining, and reordering might be time consuming.
Instead, I would create a vector from which we can subset.
This is easily done with ave and should work since the data are already ordered.
Continuing from:
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
we can do:
out <- odf[ave(odf$X4, odf$X4, FUN = seq_along) <= 4, ]
head(out)
# X1 X2 X3 X4
# 24 3 4 13 25
# 6 23 5 13 25
# 19 9 11 24 25
# 40 10 13 11 25
# 93 16 2 25 24
# 26 10 11 13 24
tail(out)
# X1 X2 X3 X4
# 61 23 7 13 2
# 2 9 9 5 2
# 17 18 18 16 2
# 67 12 1 1 1
# 52 22 14 24 1
# 9 16 24 6 1
Update: New alternatives and benchmarks
The "dplyr" package would be great for this, and the syntax is pretty compact. But first, let's set some things up to see how fast these options are:
Functions to benchmark
fun1 <- function() {
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
out <- do.call(rbind, lapply(split(odf, odf$X4), function(z) head(z[order(z$X2),],4) ))
out[order(out$X4, decreasing=TRUE),]
}
fun2 <- function() {
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
odf[ave(odf$X4, odf$X4, FUN = seq_along) <= 4, ]
}
fun3 <- function() {
DT <- data.table(df)
DT[, X := -X4]
setkey(DT, X, X2)
DT[, .SD[sequence(min(.N, 4))], by = X][, X:=NULL][]
}
fun4 <- function() {
group_by(arrange(df, desc(X4), X2), X4) %.%
mutate(vals = seq_along(X4)) %.%
filter(vals <= 4)
}
A bigger version of your sample data
set.seed(1)
df <- data.frame(matrix(sample(0:1000, 1000000 * 4, replace = TRUE), ncol = 4))
The necessary packages
library(data.table)
library(dplyr)
library(microbenchmark)
The first two approaches (Thomas's and my first approach) take a fair amount of time, so instead of benchmarking, I'll just time them once.
system.time(fun1())
# user system elapsed
# 6.645 0.007 6.670
system.time(fun2())
# user system elapsed
# 4.053 0.004 4.186
Here's the "dplyr" and "data.table" results.
microbenchmark(fun3(), fun4(), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# fun3() 2.157956 2.221746 2.303286 2.343951 2.392391 20
# fun4() 1.169212 1.180780 1.194994 1.206651 1.369922 20
Compare the output of the "dplyr" and "data.table" approaches:
out_DT <- fun3()
out_DP <- fun4()
out_DT
# X1 X2 X3 X4
# 1: 340 0 708 1000
# 2: 144 1 667 1000
# 3: 73 2 142 1000
# 4: 79 2 826 1000
# 5: 169 0 870 999
# ---
# 4000: 46 4 2 1
# 4001: 88 0 809 0
# 4002: 535 0 522 0
# 4003: 75 3 234 0
# 4004: 983 3 492 0
head(out_DP, 5)
# Source: local data frame [5 x 5]
# Groups: X4
#
# X1 X2 X3 X4 vals
# 1 340 0 708 1000 1
# 2 144 1 667 1000 2
# 3 73 2 142 1000 3
# 4 79 2 826 1000 4
# 5 169 0 870 999 1
tail(out_DP, 5)
# Source: local data frame [5 x 5]
# Groups: X4
#
# X1 X2 X3 X4 vals
# 4000 46 4 2 1 4
# 4001 88 0 809 0 1
# 4002 535 0 522 0 2
# 4003 75 3 234 0 3
# 4004 983 3 492 0 4
I include your code again with a set.seed call, so that this is exactly reproducible.
set.seed(1)
m0 <- matrix(0, 100, 4)
df <- data.frame(apply(m0, c(1,2), function(x) sample(c(0:25),1)))
odf <- df[order(-as.numeric(df$X4), as.numeric(df$X2)), ]
Here's the code you need using a split-apply-combine strategy:
out <- do.call(rbind, lapply(split(odf, odf$X4), function(z) head(z[order(z$X2),],4) ))
out <- out[order(out$X4, decreasing=TRUE),]
Result:
> dim(out)
[1] 79 4
> head(out)
X1 X2 X3 X4
25.24 3 4 13 25
25.6 23 5 13 25
25.19 9 11 24 25
25.40 10 13 11 25
24.93 16 2 25 24
24.26 10 11 13 24

Resources