transforming dataset (similarity ratings) - r

I want to transform the following data format (simplified representation):
image1 image2 rating
1 1 2 6
2 1 3 5
3 1 4 7
4 2 3 3
5 2 4 5
6 3 4 1
Reproduced by:
structure(list(image1 = c(1, 1, 1, 2, 2, 3), image2 = c(2, 3,
4, 3, 4, 4), rating = c(6, 5, 7, 3, 5, 1)), .Names = c("image1",
"image2", "rating"), row.names = c(NA, -6L), class = "data.frame")
To a format where you get a sort of correlation matrix, where the first two columns figure as indicators, and ratings are the values:
1 2 3 4
1 NA 6 5 7
2 6 NA 3 5
3 5 3 NA 1
4 7 5 1 NA
Does any of you know of a function in R to do this?

I would rather use matrix indexing:
N <- max(dat[c("image1", "image2")])
out <- matrix(NA, N, N)
out[cbind(dat$image1, dat$image2)] <- dat$rating
out[cbind(dat$image2, dat$image1)] <- dat$rating
# [,1] [,2] [,3] [,4]
# [1,] NA 6 5 7
# [2,] 6 NA 3 5
# [3,] 5 3 NA 1
# [4,] 7 5 1 NA

I don't like the <<- operator very much, but it works for this (naming your structure s):
N <- max(s[,1:2])
m <- matrix(NA, nrow=N, ncol=N)
apply(s, 1, function(x) { m[x[1], x[2]] <<- m[x[2], x[1]] <<- x[3]})
> m
[,1] [,2] [,3] [,4]
[1,] NA 6 5 7
[2,] 6 NA 3 5
[3,] 5 3 NA 1
[4,] 7 5 1 NA
Not as elegant as Karsten's solution, but it does not rely on the order of the rows, nor does it require that all combinations be present.

Here is one approach, where dat is the data frame as defined in the question
res <- matrix(0, nrow=4, ncol=4) # dim may need to be adjusted
ll <- lower.tri(res, diag=FALSE)
res[which(ll)] <- dat$rating
res <- res + t(res)
diag(res) <- NA
This works only if the rows are ordered as in the question.

Related

Efficiently replicate matrix rows by group in R

I am trying to find a way to efficiently replicate rows of a matrix in R based on a group. Let's say I have the following matrix a:
a <- matrix(
c(1, 2, 3,
4, 5, 6,
7, 8, 9),
ncol = 3, byrow = TRUE
)
I want to create a new matrix where each row in a is repeated based on a number specified in a vector (what I'm calling a "group"), e.g.:
reps <- c(2, 3, 4)
In this case, the resulting matrix would be:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 3
[3,] 4 5 6
[4,] 4 5 6
[5,] 4 5 6
[6,] 7 8 9
[7,] 7 8 9
[8,] 7 8 9
[9,] 7 8 9
This is the only solution I've come up with so far:
matrix(
rep(a, times = rep(reps, times = 3)),
ncol = 3, byrow = FALSE
)
Notice that in this solution I have to use rep() twice - first to replicate the reps vector, and then again to actually replicate each row of a.
This solution works fine, but I'm looking for a more efficient solution as in my case this is being done inside an optimization loop and is being computed in each iteration of the loop, and it's rather slow if a is large.
I'll note that this question is very similar, but it is about repeating each row the same number of times. This question is also similarly about efficiency, but it's about replicating entire matrices.
UPDATE
Since I'm interested in efficiency, here is a simple comparison of the solutions provided thus far...I'll update this as more come in, but in general it looks like the seq_along solution by F. Privé is the fastest.
library(dplyr)
library(tidyr)
a <- matrix(seq(9), ncol = 3, byrow = TRUE)
reps <- c(2, 3, 4)
rbenchmark::benchmark(
"original solution" = {
result <- matrix(rep(a, times = rep(reps, times = 3)),
ncol = 3, byrow = FALSE)
},
"seq_along" = {
result <- a[rep(seq_along(reps), reps), ]
},
"uncount" = {
result <- as.data.frame(a) %>%
uncount(reps)
},
replications = 1000,
columns = c("test", "replications", "elapsed", "relative")
)
test replications elapsed relative
1 original solution 1000 0.004 1.333
2 seq_along 1000 0.003 1.000
3 uncount 1000 1.722 574.000
Simply use a[rep(seq_along(reps), reps), ].
Another option with uncount
library(dplyr)
library(tidyr)
as.data.frame(a) %>%
uncount(reps)
-ouptut
V1 V2 V3
1 1 2 3
2 1 2 3
3 4 5 6
4 4 5 6
5 4 5 6
6 7 8 9
7 7 8 9
8 7 8 9
9 7 8 9
Another base R option (not as elegant as the answer by #F. Privé or #akrun)
> t(do.call(cbind, mapply(replicate, reps, asplit(a, 1))))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 3
[3,] 4 5 6
[4,] 4 5 6
[5,] 4 5 6
[6,] 7 8 9
[7,] 7 8 9
[8,] 7 8 9
[9,] 7 8 9

I have a table consists of many samples across many columns, how to substract the value of each sample in each column to the mean of their column? [duplicate]

I have a matrix with 5 columns and 4 rows. I also have a vector with 3 columns. I want to subtract the values in the vector from columns 3,4 and 5 respectively at each row of the matrix.
b <- matrix(rep(1:20), nrow=4, ncol=5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
c <- c(5,6,7)
to get
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 4 7 10
[2,] 2 6 5 8 11
[3,] 3 7 6 9 12
[4,] 4 8 7 10 13
This is exactly what sweep was made for:
b <- matrix(rep(1:20), nrow=4, ncol=5)
x <- c(5,6,7)
b[,3:5] <- sweep(b[,3:5], 2, x)
b
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 5 4 7 10
#[2,] 2 6 5 8 11
#[3,] 3 7 6 9 12
#[4,] 4 8 7 10 13
..or even without subsetting or reassignment:
sweep(b, 2, c(0,0,x))
Perhaps not that elegant, but
b <- matrix(rep(1:20), nrow=4, ncol=5)
x <- c(5,6,7)
b[,3:5] <- t(t(b[,3:5])-x)
should do the trick. We subset the matrix to change only the part we need, and we use t() (transpose) to flip the matrix so simple vector recycling will take care of subtracting from the correct row.
If you want to avoid the transposed, you could do something like
b[,3:5] <- b[,3:5]-x[col(b[,3:5])]
as well. Here we subset twice, and we use the second to get the correct column for each value in x because both those matrices will index in the same order.
I think my favorite from the question that #thelatemail linked was
b[,3:5] <- sweep(b[,3:5], 2, x, `-`)
Another way, with apply:
b[,3:5] <- t(apply(b[,3:5], 1, function(x) x-c))
A simple solution:
b <- matrix(rep(1:20), nrow=4, ncol=5)
c <- c(5,6,7)
for(i in 1:nrow(b)) {
b[i,3:5] <- b[i,3:5] - c
}
This can be done with the rray package in a very satisfying way (using its (numpy-like) broadcasting - operator %b-%):
#install.packages("rray")
library(rray)
b <- matrix(rep(1:20), nrow=4, ncol=5)
x <- c(5, 6, 7)
b[, 3:5] <- b[, 3:5] %b-% matrix(x, 1)
b
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 5 4 7 10
#> [2,] 2 6 5 8 11
#> [3,] 3 7 6 9 12
#> [4,] 4 8 7 10 13
For large matrices this is even faster than sweep:
#install.packages("bench")
res <- bench::press(
size = c(10, 1000, 10000),
frac_selected = c(0.1, 0.5, 1),
{
B <- matrix(sample(size*size), nrow=size, ncol=size)
B2 <- B
x <- sample(size, size=ceiling(size*frac_selected))
idx <- sample(size, size=ceiling(size*frac_selected))
bench::mark(rray = {B2[, idx] <- B[, idx, drop = FALSE] %b-% matrix(x, nrow = 1); B2},
sweep = {B2[, idx] <- sweep(B[, idx, drop = FALSE], MARGIN = 2, x); B2}
)
}
)
plot(res)

Remove all the rows containing any negative value or NA in data matrix

Data set contains 19972 rows and 3006 columns, how can i remove all the rows containing any negative value or NA in data matrix.
Generate a data.frame containing NA and negative values:
set.seed(1)
df <- data.frame(a=runif(10, -10,10), b=runif(10, -5, 5))
df$a[7] <- NA
df
df looks like this:
a b
1 -4.689827 -2.94025425
2 -2.557522 -3.23443247
3 1.457067 1.87022847
4 8.164156 -1.15896282
5 -5.966361 2.69841420
6 7.967794 -0.02300758
7 NA 2.17618508
8 3.215956 4.91906095
9 2.582281 -1.19964821
10 -8.764275 2.77445221
Then:
negative_row <- apply(df, 1, function(x) any(x < 0 | is.na(x)))
df[!negative_row,]
giving:
a b
3 1.457067 1.870228
8 3.215956 4.919061
You can try the code below
> subset(m, rowSums(sign(m) < 0) == 0)
[,1] [,2] [,3] [,4]
[1,] 1 1 2 4
[2,] 4 16 8 16
[3,] 5 25 0 20
where m is the given matrix as
> m
[,1] [,2] [,3] [,4]
[1,] 1 1 2 4
[2,] 2 NA 4 8
[3,] -3 0 6 12
[4,] 4 16 8 16
[5,] 5 25 0 20
Dummy Data
> dput(m)
structure(c(1, 2, -3, 4, 5, 1, NA, 0, 16, 25, 2, 4, 6, 8, 0,
4, 8, 12, 16, 20), .Dim = 5:4, .Dimnames = list(NULL, NULL))

All possible combinations of two vectors while keeping the order in R

I have a vector, say vec1, and another vector named vec2 as follows:
vec1 = c(4,1)
# [1] 4 1
vec2 = c(5,3,2)
# [1] 5 3 2
What I'm looking for is all possible combinations of vec1 and vec2 while the order of the vectors' elements is kept. That is, the resultant matrix should be like this:
> res
[,1] [,2] [,3] [,4] [,5]
[1,] 4 1 5 3 2
[2,] 4 5 1 3 2
[3,] 4 5 3 1 2
[4,] 4 5 3 2 1
[5,] 5 4 1 3 2
[6,] 5 4 3 1 2
[7,] 5 4 3 2 1
[8,] 5 3 4 1 2
[9,] 5 3 4 2 1
[10,] 5 3 2 4 1
# res=structure(c(4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 1, 5, 5, 5, 4, 4, 4,
# 3, 3, 3, 5, 1, 3, 3, 1, 3, 3, 4, 4, 2, 3, 3, 1, 2, 3, 1, 2, 1,
# 2, 4, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1), .Dim = c(10L, 5L))
There is no repetition allowed for two vectors. That is, all rows of the resultant matrix have unique elements.
I'm actually looking for the most efficient way. One way to tackle this problem is to generate all possible permutations of length n which grows factorially (n=5 here) and then apply filtering. But it's time-consuming as n grows.
Is there an efficient way to do that?
Try this one:
nv1 <- length(vec1)
nv2 <- length(vec2)
n <- nv1 + nv2
result <- combn(n,nv1,function(v) {z=integer(n);z[v]=vec1;z[-v]=vec2;z})
The idea is to produce all combinations of indices at which to put the elements of vec1.
Not that elegant as Marat Talipov solution, but you can do:
# get the ordering per vector
cc <- c(order(vec1,decreasing = T), order(vec2, decreasing = T)+length(vec1))
cc
[1] 1 2 3 4 5
# permutation to get all "order-combinations"
library(combinat)
m <- do.call(rbind, permn(cc))
# remove unsorted per vector, only if both vectors are correct set TRUE for both:
gr <- apply(m, 1, function(x){
!is.unsorted(x[x < (length(vec1)+1)]) & !is.unsorted(x[x > (length(vec1))])
})
# result, exchange the order index with the vector elements:
t(apply(m[gr, ], 1, function(x, y) y[x], c(vec1, vec2)))
[,1] [,2] [,3] [,4] [,5]
[1,] 4 1 5 3 2
[2,] 4 5 3 1 2
[3,] 4 5 3 2 1
[4,] 4 5 1 3 2
[5,] 5 4 1 3 2
[6,] 5 4 3 2 1
[7,] 5 4 3 1 2
[8,] 5 3 4 1 2
[9,] 5 3 4 2 1
[10,] 5 3 2 4 1

subtract a constant vector from each row in a matrix in r

I have a matrix with 5 columns and 4 rows. I also have a vector with 3 columns. I want to subtract the values in the vector from columns 3,4 and 5 respectively at each row of the matrix.
b <- matrix(rep(1:20), nrow=4, ncol=5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
c <- c(5,6,7)
to get
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 4 7 10
[2,] 2 6 5 8 11
[3,] 3 7 6 9 12
[4,] 4 8 7 10 13
This is exactly what sweep was made for:
b <- matrix(rep(1:20), nrow=4, ncol=5)
x <- c(5,6,7)
b[,3:5] <- sweep(b[,3:5], 2, x)
b
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 5 4 7 10
#[2,] 2 6 5 8 11
#[3,] 3 7 6 9 12
#[4,] 4 8 7 10 13
..or even without subsetting or reassignment:
sweep(b, 2, c(0,0,x))
Perhaps not that elegant, but
b <- matrix(rep(1:20), nrow=4, ncol=5)
x <- c(5,6,7)
b[,3:5] <- t(t(b[,3:5])-x)
should do the trick. We subset the matrix to change only the part we need, and we use t() (transpose) to flip the matrix so simple vector recycling will take care of subtracting from the correct row.
If you want to avoid the transposed, you could do something like
b[,3:5] <- b[,3:5]-x[col(b[,3:5])]
as well. Here we subset twice, and we use the second to get the correct column for each value in x because both those matrices will index in the same order.
I think my favorite from the question that #thelatemail linked was
b[,3:5] <- sweep(b[,3:5], 2, x, `-`)
Another way, with apply:
b[,3:5] <- t(apply(b[,3:5], 1, function(x) x-c))
A simple solution:
b <- matrix(rep(1:20), nrow=4, ncol=5)
c <- c(5,6,7)
for(i in 1:nrow(b)) {
b[i,3:5] <- b[i,3:5] - c
}
This can be done with the rray package in a very satisfying way (using its (numpy-like) broadcasting - operator %b-%):
#install.packages("rray")
library(rray)
b <- matrix(rep(1:20), nrow=4, ncol=5)
x <- c(5, 6, 7)
b[, 3:5] <- b[, 3:5] %b-% matrix(x, 1)
b
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 1 5 4 7 10
#> [2,] 2 6 5 8 11
#> [3,] 3 7 6 9 12
#> [4,] 4 8 7 10 13
For large matrices this is even faster than sweep:
#install.packages("bench")
res <- bench::press(
size = c(10, 1000, 10000),
frac_selected = c(0.1, 0.5, 1),
{
B <- matrix(sample(size*size), nrow=size, ncol=size)
B2 <- B
x <- sample(size, size=ceiling(size*frac_selected))
idx <- sample(size, size=ceiling(size*frac_selected))
bench::mark(rray = {B2[, idx] <- B[, idx, drop = FALSE] %b-% matrix(x, nrow = 1); B2},
sweep = {B2[, idx] <- sweep(B[, idx, drop = FALSE], MARGIN = 2, x); B2}
)
}
)
plot(res)

Resources