How to easily create dissimilarity matrix from vector of differences? - r

In my research each subject was given n*(n-1)/2 questions about his subjective opinion about dissimilarity between n=5 objects (for later use with 3-way multidimensional scaling).
I want to create a dissimilarity matrix from the 10-item vector v, arranged e.g. in the following fashion (for n=5):
1
2 5
3 6 8
4 7 9 10
This is a code sample code for achieving it for this particular n:
dissim<-rep(0,n*n)
dim(dissim)<-c(5,5)
dissim[2,1]<-v[1]
dissim[3,1]<-v[2]
dissim[4,1]<-v[3]
dissim[5,1]<-v[4]
dissim[3,2]<-v[5]
dissim[4,2]<-v[6]
dissim[5,2]<-v[7]
dissim[4,3]<-v[8]
dissim[5,3]<-v[9]
dissim[5,4]<-v[10]
Is there any utility function which helps doing it for any n? I know I can use two nested loops to do it, but the code would be more clear if I used a dedicated function.
And maybe I would learn about the existence of another useful library in the process?

n <- 5
mat <- matrix(0, ncol = n, nrow = n)
mat[lower.tri(mat)] <- 1:10
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 2 5 0 0 0
[4,] 3 6 8 0 0
[5,] 4 7 9 10 0

Er... By chance I found the solution myself. It so happens, that the internal structure of the dist object is just the vector v. So what works is this:
dissim<-v
class(dissim)='dist'
attr(dissim,"Size")<-5
dissim<-as.dist(dissim)
It works now, but I am not sure if this is a documented way and will always be valid.

Related

Cluster groups of 1s in a binary matrix

I'm looking to create clusters around all 1s and 0s. Similar to Mindsweeper, I want to basically "draw a circle" around all 1s, and create a border where 0s exist.
I have tried using hclust() and creating a distance matrix, but the actual table I am working with is very large, and I have run into problems with run time.
test_matrix <- matrix(c( 1,1,0,0,0,0,1,
1,1,1,0,0,1,0,
0,1,0,0,0,1,0,
0,0,0,1,1,1,0,
0,0,0,1,1,1,1),nrow=5)
Result looks like this:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 0 0 1 0 1 0
[2,] 1 1 0 0 0 1 1
[3,] 0 1 1 0 0 0 1
[4,] 0 1 0 0 0 0 1
[5,] 0 1 0 1 1 0 1
My rules are as follows: If any 1 is connected to any 1 via UP, DOWN, LEFT, RIGHT, DIAGONAL(any direction), continue growing the "cluster". Based on these rules (8 points of connection for every point), I can spot four unique clusters with isolated 1s.
How would you code to find these groups?
I think clustering is the right approach here, but you choose a poor ( computationally expensive) method for the task. I would go for DBSCAN like this:
library(dbscan)
## slightly altered test matrix to include a "cluster" with a single 1
test_matrix <- matrix(c( 1,1,0,0,0,0,1,
1,1,1,0,0,1,0,
0,1,0,0,0,1,0,
0,0,0,1,1,1,0,
1,0,0,1,1,1,1),
nrow=5, byrow = TRUE)
## find rows and columns of 1s
ones_pos <- which(test_matrix > 0,arr.ind=TRUE)
## perform DBSCAN clustering
## setting eps = sqrt(2) + .1 corresponds to your neighbourhood definition
## setting minPts = 2 will mark clusters of one point as noise
clust <- dbscan(ones_pos, eps = sqrt(2), minPts = 2)
## find the indices of noise elements
singular_ones <- ones_pos[clust$cluster == 0, ]
singular_ones
#> row col
#> 5 1
To find all clusters (including those that just consist of one 1) just set minPts to 1. In this case there can be no noise. The cluster membership is stored in clust$cluster.
I am quite certain this approach will also be quite fast with large matrices.

Fastest way to populate a matrix using row/column indicies stored in vectors

I'm trying to do something that seems relatively straightforward to do with something apply-esque, but I can only get it to work using a for loop.
The general idea is I have two vectors, with one vector corresponding to a row in the matrix and another vector corresponding to the column, both the same length. I start with a 0 matrix, and increment [row,column] based on the pair of values in the two vectors. For example:
vectorCols <- c(1,2,3,1,3)
vectorRows <- c(2,1,2,3,2)
countMat <- matrix(rep(0,9),ncol=3)
And at the end, countMat is:
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 2
[3,] 1 0 0
This is pretty manageable with a for loop:
for (i in 1:length(vectorCols)){
countMat[vectorRows[i],vectorCols[i]] <- countMat[vectorRows[i],vectorCols[i]] + 1
}
But I can't help thinking there is a better way to do this in R. I've tried using the apply family of functions, but these don't cooperate well when you want to assign something. I know I could use mapply and build each element of countMat one value at a time, but this seems inefficient--vectorRows and vectorCols are very long, and it seems wasteful to fully traverse them an entire time for each cell in countMat. But other than a loop and mapply, I can't think of how to do this. I've considered using assign with one of the apply family, but there's a caveat--my matrix actually has names for the columns and rows, with the names stored in vectorCols and vectorRows, and it seems assign doesn't want to play well something like countMat["rowName"]["columnName"] (not to mention thatapply` will still want to return a value for each step in the iteration).
Any suggestions? I'd also be curious if there is an ideal way to do this if I don't have names for the vector columns and rows. If that's the case then maybe I can convert vectorCols and vectorRows to numbers, then build the matrix, then rename everything.
Thanks all.
Here are some solutions. No packages are needed.
1) table
table(vectorRows, vectorCols)
giving:
vectorCols
vectorRows 1 2 3
1 0 1 0
2 1 0 2
3 1 0 0
Note that if there is any row or column with no entries then it will not appear.
2) aggregate
ag <- aggregate( Freq ~ ., data.frame(Freq = 1, vectorRows, vectorCols), sum)
countMat[as.matrix(ag[-3])] <- ag[[3]]
giving:
> countMat
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 2
[3,] 1 0 0
3) xtabs
xtabs(~ vectorRows + vectorCols)
giving:
vectorCols
vectorRows 1 2 3
1 0 1 0
2 1 0 2
3 1 0 0

R constrainted all combinations [duplicate]

I am trying to randomly sample 7 numbers from 0 to 7 (with replacement), but subject to the constraint that the numbers chosen add up to 7. So for instance, the output 0 1 1 2 3 0 0 is okay, but the output 1 2 3 4 5 6 7 is not. Is there a way to use the sample command with added constraints?
I intend to use the replicate() function with the sample command as an argument, to return a list of N different vectors form the sample command. The way I am currently using the sample command (without any constraints), I need N to be very large in order to get as many possible vectors that sum to exactly 7 as possible. I figure there must be an easier way to do this!
Here is my code for that part:
x <- replicate(100000, sample(0:7, 7, replace=T))
Ideally, I want 10,000 or 100,000 vectors in x to sum to 7, but would need an enormous N value to do this. Thanks for any help.
To make sure you're sampling uniformly, you could just generate all the permutations and limit to those that sum to 7:
library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]
From nrow(perms7), we see there are only 1716 possible permutations that sum to 7. Now you can uniformly sample from the permutations:
set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 0 0 0 2 5 0 0
# [2,] 1 3 0 1 2 0 0
# [3,] 1 4 1 1 0 0 0
# [4,] 1 0 0 3 0 3 0
# [5,] 0 2 0 0 0 5 0
# [6,] 1 1 2 0 0 2 1
An advantage of this approach is that it's easy to see that we're sampling uniformly at random. Also, it's quite quick -- building perms7 took 0.3 seconds on my computer and building a 1 million-row my.perms took 0.04 seconds. If you need to draw many vectors this will be quite a bit quicker than a recursive approach because you're just using matrix indexing into perms7 instead of generating each vector separately.
Here's a distribution of counts of numbers in the sample:
# 0 1 2 3 4 5 6 7
# 323347 188162 102812 51344 22811 8629 2472 423
Start with all zeroes, add one to any element, do 7 times:
sumTo = function(){
v = rep(0,7)
for(i in 1:7){
addTo=sample(7)[1]
v[addTo]=v[addTo]+1
}
v
}
Or equivalently, just choose which of the 7 elements you are going to increment in one sample of length 7, then tabulate those, making sure you tabulate up to 7:
sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}
> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2
I don't know if this will produce a uniform sample from all possible combinations...
The distribution of individual elements over 100,000 reps is:
> X = replicate(100000,sumTo())
> table(X)
X
0 1 2 3 4 5 6
237709 277926 138810 38465 6427 627 36
Didn't hit a 0,0,0,0,0,7 that time!
This recursive algorithm will output a distribution with a higher probability for large numbers than the other solutions. The idea is to throw a random number y in 0:7 in any of the seven available slots, then repeat with a random number in 0:(7-y), etc:
sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
if (n == 1) return(s)
x <- x[x <= s]
y <- sample(x, 1)
sample(c(y, Recall(x, n - 1L, s - y)))
}
set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1
Drawing 100,000 vectors took 11 seconds on my machine and here is the distribution I get:
# 0 1 2 3 4 5 6 7
# 441607 98359 50587 33364 25055 20257 16527 14244
There may be an easier and/or more elegant way, but here's a brute-force method using the LSPM:::.nPri function. The link includes the definition for an R-only version of the algorithm, for those interested.
#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]
Now you can sample from z and be assured that you're getting a permutation that sums to 7.
I find this question intriguing and gave it some extra thought. Another (more general) approach to (approximate) sample uniformly from all feasible solutions, without generating and storing all permutations (which is clearly not possible in the case with much more than 7 numbers), in R by sample(), could be a simple MCMC implementation:
S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
opt <- sum(S[-b]) #sum of complementary elements
a <- sample(0:(7-opt), 1) #sample a substistute
S[b[1]] <- a #change elements
S[b[2]] <- 7 - opt - a
}
S #new sample
This is of course really fast for a few samples. The "distribution":
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 321729 189647 103206 52129 22287 8038 2532 432
Of course in this case, where it's actually possible to find and store all combinations, and if you want a huge sample from all feasible outcomes, just use partitions::compositions(7, 7), as also suggested by Josh O'Brien in the comments, to avoid calculating all the permutations, when only a small fraction is needed:
perms7 <- partitions::compositions(7, 7)
>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000: 0 1 2 3 4 5 6 7
# 323075 188787 102328 51511 22754 8697 2413 435

R Turning a list into a matrix when the list contains objects of "different size"

I've seen a couple of questions about turning matrices into lists (not really clear why you would want that) but the reverse operation I've been unable to find.
Basically, following
# ind.dum = data frame with 29 observations and 2635 variables
for (i in 1:ncol(ind.dum))
tmp[[i]]<-which(rollapply(ind.dum[,i],4,identical,c(1,0,0,0),by.column=T))
I got a list of 2635 objects, most of which contain 1 value, bust some up to 7. I'd need to convert this to a matrix with 2635 rows and as many columns as necessary to fit every value in a separate cells (with 0 values for the rest).
I tried all the coerce measures I know (as.data.frame, as.matrix ...) and also the option to define a new matrix with the maximum dimensions but nothing works.
m<-matrix(0,nrow=2635,ncol=7)
tmp_m<-structure(tmp,dim=dim(m))
Error in structure(tmp,dim=dim(m))dims [product 18445] do not match the length of object [2635]
I'm sure there's a quick fix for this so I'm hoping someone can help me with it. Btw, my values in the tmp list's objects are numeric, although some are "integer(0)" , i.e. when the pattern c(1,0,0,0) was not found in the columns of the original ind.dum matrix.
Not sure if there is a way to use unlist without losing the information about which values belong originally to the same row...
Desired Output
A matrix or dataframe with 2635 rows and 7 columns and looking like this
12 0 0 0 0 0 0
8 14 0 0 0 0 0
0 0 0 0 0 0 0
1 4 8 12 0 0 0
...
The values basically refer to years in which a specific pattern started. I need to be able to be able to use that information to tie this problem to an earlier problem described before (see this link).
Try this for example:
do.call(rbind,lapply(ll,
function(x)
if(length(x)==1)c(x,rep(0,6))
else x))
Here's a fast alternative that does what it sounds like you are describing:
First, sample data always helps:
LL <- list(1:3, numeric(0), c(1:3,1), 1:7)
LL
# [[1]]
# [1] 1 2 3
#
# [[2]]
# numeric(0)
#
# [[3]]
# [1] 1 2 3 1
#
# [[4]]
# [1] 1 2 3 4 5 6 7
Second, we'll make use of a little trick referred to as matrix indexing to fill an empty matrix with the values from your list.
## We need to know how many columns are needed for each list item
Ncol <- vapply(LL, length, 1L)
## M is our empty matrix, pre-filled with zeroes
M <- matrix(0, nrow = length(LL), ncol = max(Ncol))
## IJ is the row/column combination where values need to be inserted
IJ <- cbind(rep(seq_along(Ncol), times = Ncol), sequence(Ncol))
## Extract and insert!
M[IJ] <- unlist(LL, use.names = FALSE)
## View the result
M
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 1 2 3 0 0 0 0
# [2,] 0 0 0 0 0 0 0
# [3,] 1 2 3 1 0 0 0
# [4,] 1 2 3 4 5 6 7
I have a solution.
Not sure if it is good enough or there's any bug.
LL <- list(1:3, numeric(0), c(1:3, 1), 1:7)
with(data.frame(m <- plyr::rbind.fill.matrix(lapply(LL, matrix, nrow = 1))), replace(m, is.na(m), 0))

back and forth to dummy variables in R

So, I've been using R on and off for two years now and been trying to get this whole idea of vectorization. Since I deal a lot with dummy variables from multiple response sets from surveys I thought it would be interesting to learn with this case.
The idea is to go from multiple responses to dummy variables (and back), for example: "Of these 8 different chocolates, which are your favorite ones (choose up to 3) ?"
Sometimes we code this as dummy variables (1 for person likes "Cote d'Or", 0 for person doesn't like it), with 1 variable per option, and some times as categorical (1 for person likes "Cote d'Or", 2 for person likes "Lindt", and so on), with 3 variables for the 3 choices.
So, basically I can end up with one a matrix which lines are like
1,0,0,1,0,0,1,0
Or a matrix with lines like
1,4,7
And the idea, as mentioned, is to go from one to the other. So far I got a loop solution for each case and a vectorized solution for going from dummy to categorical. I would appreciate any further insigh into this matter and a vectorized solution for the categorical to dummy step.
DUMMY TO NOT DUMMY
vecOrig<-matrix(0,nrow=18,ncol=8) # From this one
vecDest<-matrix(0,nrow=18,ncol=3) # To this one
# Populating the original matrix.
# I'm pretty sure this could have been added to the definition of the matrix,
# but I kept getting repeated numbers.
# How would you vectorize this?
for (i in 1:length(vecOrig[,1])){
vecOrig[i,]<-sample(vec)
}
# Now, how would you vectorize this following step...
for(i in 1:length(vecOrig[,1])){
vecDest[i,]<-grep(1,vecOrig[i,])
}
# Vectorized solution, I had to transpose it for some reason.
vecDest2<-t(apply(vecOrig,1,function(x) grep(1,x)))
NOT DUMMY TO DUMMY
matOrig<-matrix(0,nrow=18,ncol=3) # From this one
matDest<-matrix(0,nrow=18,ncol=8) # To this one.
# We populate the origin matrix. Same thing as the other case.
for (i in 1:length(matOrig[,1])){
matOrig[i,]<-sample(1:8,3,FALSE)
}
# this works, but how to make it vectorized?
for(i in 1:length(matOrig[,1])){
for(j in matOrig[i,]){
matDest[i,j]<-1
}
}
# Not a clue of how to vectorize this one.
# The 'model.matrix' solution doesn't look neat.
Vectorized solutions:
Dummy to not dummy
vecDest <- t(apply(vecOrig == 1, 1, which))
Not dummy to dummy (back to the original)
nCol <- 8
vecOrig <- t(apply(vecDest, 1, replace, x = rep(0, nCol), values = 1))
This might provide some inside for the first part:
#Create example data
set.seed(42)
vecOrig<-matrix(rbinom(20,1,0.2),nrow=5,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 1
[2,] 1 0 0 1
[3,] 0 0 1 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Note that this does not assume, that the number of ones is equal in each line (e.g., you wrote "choose up to 3").
#use algebra to create position numbers
vecDest <- t(t(vecOrig)*1:ncol(vecOrig))
[,1] [,2] [,3] [,4]
[1,] 1 0 0 4
[2,] 1 0 0 4
[3,] 0 0 3 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Now, we remove the zeros. Thus, we have to turn the object into a list.
vecDest <- split(t(vecDest), rep(1:nrow(vecDest), each = ncol(vecDest)))
lapply(vecDest,function(x) x[x>0])
$`1`
[1] 1 4
$`2`
[1] 1 4
$`3`
[1] 3
$`4`
[1] 1
$`5`
numeric(0)

Resources