assign cluster labels to data using a cluster assignment matrix - r

Hi I am using R and have a cluster assignment matrix that comes out of my clustering function. (I am applying a clustering algorithm on a gaussian mixture data) I want to create a data matrix of clusters. Here is a toy example of what I want to do.
#simulate data
dat=Z<-c(rnorm(2,0,1),rnorm(2,2,3),rnorm(3,0,1),rnorm(3,2,3))
dat
[1] -0.5350681 1.0444655 2.9229136 8.2528266 -0.7561170 -1.0240702 -1.0012780
[8] -0.1322981 7.8525855 2.2278264
# Making up a cluster assignment matrix (actually this one comes out of my
#clustering function
amat<-matrix(c(1,1,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,1,1,1), ncol=2, nrow=10)
amat
[,1] [,2]
[1,] 1 0
[2,] 1 0
[3,] 0 1
[4,] 0 1
[5,] 1 0
[6,] 1 0
[7,] 1 0
[8,] 0 1
[9,] 0 1
[10,] 0 1
I want to create dataframe or vector called (say) "clust" that contains cluster labels as follows using the assignment matrix given above.Basically it uses first column and second column of assignment matrix and assigns label 1 to data coming from normal distribution N(0,1) and assigns label 2 to the data coming from normal distribution N(2,3).Any help is appreciated. Thanks in advance.
# clust should look like this (I have no idea how to create this using amat and dat)
clust
[1] 1 1 2 2 1 1 1 2 2 2

The vector is already binary. We can add 1L to the second column:
clust <- amat[,2] + 1L
[1] 1 1 2 2 1 1 1 2 2 2
(The suffix L coerces the value to integer)

Isn't this essentially
1 * column1 + 2 * column2 +3 * column3 and so on?
that should be straight forward to write as a matrix multiplocation woth [1,2,3,4,...] and a sum operation.

Related

R Turning a list into a matrix when the list contains objects of "different size"

I've seen a couple of questions about turning matrices into lists (not really clear why you would want that) but the reverse operation I've been unable to find.
Basically, following
# ind.dum = data frame with 29 observations and 2635 variables
for (i in 1:ncol(ind.dum))
tmp[[i]]<-which(rollapply(ind.dum[,i],4,identical,c(1,0,0,0),by.column=T))
I got a list of 2635 objects, most of which contain 1 value, bust some up to 7. I'd need to convert this to a matrix with 2635 rows and as many columns as necessary to fit every value in a separate cells (with 0 values for the rest).
I tried all the coerce measures I know (as.data.frame, as.matrix ...) and also the option to define a new matrix with the maximum dimensions but nothing works.
m<-matrix(0,nrow=2635,ncol=7)
tmp_m<-structure(tmp,dim=dim(m))
Error in structure(tmp,dim=dim(m))dims [product 18445] do not match the length of object [2635]
I'm sure there's a quick fix for this so I'm hoping someone can help me with it. Btw, my values in the tmp list's objects are numeric, although some are "integer(0)" , i.e. when the pattern c(1,0,0,0) was not found in the columns of the original ind.dum matrix.
Not sure if there is a way to use unlist without losing the information about which values belong originally to the same row...
Desired Output
A matrix or dataframe with 2635 rows and 7 columns and looking like this
12 0 0 0 0 0 0
8 14 0 0 0 0 0
0 0 0 0 0 0 0
1 4 8 12 0 0 0
...
The values basically refer to years in which a specific pattern started. I need to be able to be able to use that information to tie this problem to an earlier problem described before (see this link).
Try this for example:
do.call(rbind,lapply(ll,
function(x)
if(length(x)==1)c(x,rep(0,6))
else x))
Here's a fast alternative that does what it sounds like you are describing:
First, sample data always helps:
LL <- list(1:3, numeric(0), c(1:3,1), 1:7)
LL
# [[1]]
# [1] 1 2 3
#
# [[2]]
# numeric(0)
#
# [[3]]
# [1] 1 2 3 1
#
# [[4]]
# [1] 1 2 3 4 5 6 7
Second, we'll make use of a little trick referred to as matrix indexing to fill an empty matrix with the values from your list.
## We need to know how many columns are needed for each list item
Ncol <- vapply(LL, length, 1L)
## M is our empty matrix, pre-filled with zeroes
M <- matrix(0, nrow = length(LL), ncol = max(Ncol))
## IJ is the row/column combination where values need to be inserted
IJ <- cbind(rep(seq_along(Ncol), times = Ncol), sequence(Ncol))
## Extract and insert!
M[IJ] <- unlist(LL, use.names = FALSE)
## View the result
M
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 1 2 3 0 0 0 0
# [2,] 0 0 0 0 0 0 0
# [3,] 1 2 3 1 0 0 0
# [4,] 1 2 3 4 5 6 7
I have a solution.
Not sure if it is good enough or there's any bug.
LL <- list(1:3, numeric(0), c(1:3, 1), 1:7)
with(data.frame(m <- plyr::rbind.fill.matrix(lapply(LL, matrix, nrow = 1))), replace(m, is.na(m), 0))

Creating a Matrix From a Vector in R

I have a vector with two columns, one column containing numerical values and one column containing names.I'm a novice to R but essentially I want to take a vector and create a matrix with it wherein the values within the matrix would add together. So for example, where the vector A has a value of 1 and B has a value of 1, in the matrix at the intersection of A and B I want the values to add and become 2.
I've tried to use a for loop but I'm having trouble with the arguments to put within the loop. Any help would be greatly appreciated and I'd be glad to clarify stuff if it doesn't make sense.
Essentially what I want is to take this:
A 1
B 0
C 0
D 1
And turn it into this:
A B C D
A 1 1 2
B 1 0 1
C 1 0 1
D 2 1 1
Thanks!
R > x <- c(1,0,0,1)
R > outer(x, x, "+")
[,1] [,2] [,3] [,4]
[1,] 2 1 1 2
[2,] 1 0 0 1
[3,] 1 0 0 1
[4,] 2 1 1 2
The next thing is to ignore the diagonal. Updated by Vincent:
names(x) <- c("A","B","C","D")

How to efficiently retrieve top K-similar vectors by cosine similarity using R?

I'm working on a high-dimensional problem (~4k terms) and would like to retrieve top k-similar (by cosine similarity) and can't afford to do a pair-wise calculation.
My training set is 6million x 4k matrix and I would like to make predictions for 600k x 4k matrix.
What is the most efficient way to retrieve the k-similar items for each item in my 600k x 4k matrix?
Ideally, I would like to get a matrix which is 600k x 10 (i.e., top 10-similar items for each of the 600k items).
ps: I've researched the SO website and found almost all "cosine similarity in R" questions refer to cosine_sim(vector1, vector2). But this question refers to cosine_sim(matrix1, matrix2).
Update
The following code uses a naive method to find the cosine similarity between each row in the testset and every row in the training set.
set.seed(123)
train<-matrix(round(runif(30),0),nrow=6,ncol=5)
set.seed(987)
test<-matrix(round(runif(20),0),nrow=4,ncol=5)
train
[1,] 0 1 1 0 1
[2,] 1 1 1 1 1
[3,] 0 1 0 1 1
[4,] 1 0 1 1 1
[5,] 1 1 0 1 0
[6,] 0 0 0 1 0
test
[1,] 0 1 1 0 0
[2,] 1 0 1 0 1
[3,] 1 0 0 0 0
[4,] 1 0 0 1 1
coSim<-function(mat1, mat2, topK){
require(plyr)
#mat2: is the testset
#mat1: is the training set. We will find cosine similarity between each row in testset and every row in trainingset.
#topK: user-input. for each row in testset we will return 'topk' similar rows(index) from the testset
#set up an empty result matrix. nrow(result) will be the same as the cartesian product between mat1 & mat2.
result<-matrix(rep(NA, nrow(mat1)*nrow(mat2)), nrow=nrow(mat1)*nrow(mat2), ncol=3)
k=1
for(i in 1:nrow(mat2)){
for(j in 1:nrow(mat1)){
result[k,1]<-i
result[k,2]<-j
result[k,3]<-crossprod(mat1[j,], mat2[i,])/sqrt(crossprod(mat1[j,]) * crossprod(mat2[i,]))
k<-k+1
}
}
#sort the result matrix by cosine similarity found for each row in testset. not sure how to keep topK from each group so convert to df
result<-as.data.frame(result)
colnames(result)<-c("testRowId", "trainRowId","CosineSimilarity")
result<-ddply(result, "testRowId", function(x) head(x[order(x$CosineSimilarity, decreasing = TRUE) , ], topK))
resultMat<-matrix(result$trainRowId, nrow=nrow(mat2), ncol=topK,byrow=T)
finalResult<-list(similarity=result, index=resultMat)
}
system.time(cosineSim<-coSim(train, test, topK=2)) #0.12 secs
cosineSim
$similarity
testRowId trainRowId CosineSimilarity
1 1 1 0.8164966
2 1 2 0.6324555
3 2 4 0.8660254
4 2 2 0.7745967
5 3 5 0.5773503
6 3 4 0.5000000
7 4 4 0.8660254
8 4 2 0.7745967
$index
[,1] [,2]
[1,] 1 2
[2,] 4 2
[3,] 5 4
[4,] 4 2
set.seed(123)
train<-matrix(round(runif(1000000),0),nrow=5000,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)
system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
When I run the same function with 5000x200 matrix for training and 2000x200 matrix for testing, it took over 380secs.
Ideally, I would like to see some ideas where I do not have to calculate the similarity between each and every row. If that is not possible, some pointers on how to vectorise the above code will be helpful.
No need to compute the similarity for every row. You can use this instead:
coSim2<-function(mat1, mat2, topK){
#similarity computation:
xy <- tcrossprod(mat1, mat2)
xx <- rowSums(mat1^2)
yy <- rowSums(mat2^2)
result <- xy/sqrt(outer(xx,yy))
#top similar rows from train (per row in test):
top <- apply(result, 2, order, decreasing=TRUE)[1:topK,]
result_df <- data.frame(testRowId=c(col(top)), trainRowId=c(top))
result_df$CosineSimilarity <- result[as.matrix(result_df[,2:1])]
list(similarity=result_df, index=t(top))
}
Test data (I've reduced your train matrix)
set.seed(123)
train<-matrix(round(runif(100000),0),nrow=500,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)
Result:
> system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
user system elapsed
41.71 1.59 43.72
> system.time(cosineSim2<-coSim2(train, test, topK=50)) #380secs
user system elapsed
0.46 0.02 0.49
Using your full 5000 x 200 train matrix, coSim2 runs in 7.8 sec.
Also note:
> any(cosineSim$similarity != cosineSim2$similarity)
[1] FALSE
> any(cosineSim$index != cosineSim2$index)
[1] FALSE
You can't use identical because my function returns integers instead of doubles for row IDs.

back and forth to dummy variables in R

So, I've been using R on and off for two years now and been trying to get this whole idea of vectorization. Since I deal a lot with dummy variables from multiple response sets from surveys I thought it would be interesting to learn with this case.
The idea is to go from multiple responses to dummy variables (and back), for example: "Of these 8 different chocolates, which are your favorite ones (choose up to 3) ?"
Sometimes we code this as dummy variables (1 for person likes "Cote d'Or", 0 for person doesn't like it), with 1 variable per option, and some times as categorical (1 for person likes "Cote d'Or", 2 for person likes "Lindt", and so on), with 3 variables for the 3 choices.
So, basically I can end up with one a matrix which lines are like
1,0,0,1,0,0,1,0
Or a matrix with lines like
1,4,7
And the idea, as mentioned, is to go from one to the other. So far I got a loop solution for each case and a vectorized solution for going from dummy to categorical. I would appreciate any further insigh into this matter and a vectorized solution for the categorical to dummy step.
DUMMY TO NOT DUMMY
vecOrig<-matrix(0,nrow=18,ncol=8) # From this one
vecDest<-matrix(0,nrow=18,ncol=3) # To this one
# Populating the original matrix.
# I'm pretty sure this could have been added to the definition of the matrix,
# but I kept getting repeated numbers.
# How would you vectorize this?
for (i in 1:length(vecOrig[,1])){
vecOrig[i,]<-sample(vec)
}
# Now, how would you vectorize this following step...
for(i in 1:length(vecOrig[,1])){
vecDest[i,]<-grep(1,vecOrig[i,])
}
# Vectorized solution, I had to transpose it for some reason.
vecDest2<-t(apply(vecOrig,1,function(x) grep(1,x)))
NOT DUMMY TO DUMMY
matOrig<-matrix(0,nrow=18,ncol=3) # From this one
matDest<-matrix(0,nrow=18,ncol=8) # To this one.
# We populate the origin matrix. Same thing as the other case.
for (i in 1:length(matOrig[,1])){
matOrig[i,]<-sample(1:8,3,FALSE)
}
# this works, but how to make it vectorized?
for(i in 1:length(matOrig[,1])){
for(j in matOrig[i,]){
matDest[i,j]<-1
}
}
# Not a clue of how to vectorize this one.
# The 'model.matrix' solution doesn't look neat.
Vectorized solutions:
Dummy to not dummy
vecDest <- t(apply(vecOrig == 1, 1, which))
Not dummy to dummy (back to the original)
nCol <- 8
vecOrig <- t(apply(vecDest, 1, replace, x = rep(0, nCol), values = 1))
This might provide some inside for the first part:
#Create example data
set.seed(42)
vecOrig<-matrix(rbinom(20,1,0.2),nrow=5,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 1
[2,] 1 0 0 1
[3,] 0 0 1 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Note that this does not assume, that the number of ones is equal in each line (e.g., you wrote "choose up to 3").
#use algebra to create position numbers
vecDest <- t(t(vecOrig)*1:ncol(vecOrig))
[,1] [,2] [,3] [,4]
[1,] 1 0 0 4
[2,] 1 0 0 4
[3,] 0 0 3 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Now, we remove the zeros. Thus, we have to turn the object into a list.
vecDest <- split(t(vecDest), rep(1:nrow(vecDest), each = ncol(vecDest)))
lapply(vecDest,function(x) x[x>0])
$`1`
[1] 1 4
$`2`
[1] 1 4
$`3`
[1] 3
$`4`
[1] 1
$`5`
numeric(0)

random sampling - matrix

How can I take a sample of n random points from a matrix populated with 1's and 0's ?
a=rep(0:1,5)
b=rep(0,10)
c=rep(1,10)
dataset=matrix(cbind(a,b,c),nrow=10,ncol=3)
dataset
[,1] [,2] [,3]
[1,] 0 0 1
[2,] 1 0 1
[3,] 0 0 1
[4,] 1 0 1
[5,] 0 0 1
[6,] 1 0 1
[7,] 0 0 1
[8,] 1 0 1
[9,] 0 0 1
[10,] 1 0 1
I want to be sure that the positions(row,col) from were I take the N samples are random.
I know sample {base} but it doesn't seem to allow me to do that, other methods I know are spatial methods that will force me to add x,y and change it to a spatial object and again back to a normal matrix.
More information
By random I mean also spread inside the "matrix space", e.g. if I make a sampling of 4 points I don't want to have as a result 4 neighboring points, I want them spread in the "matrix space".
Knowing the position(row,col) in the matrix where I took out the random points would also be important.
There is a very easy way to sample a matrix that works if you understand that R represents a matrix internally as a vector.
This means you can use sample directly on your matrix. For example, let's assume you want to sample 10 points with replacement:
n <- 10
replace=TRUE
Now just use sample on your matrix:
set.seed(1)
sample(dataset, n, replace=replace)
[1] 1 0 0 1 0 1 1 0 0 1
To demonstrate how this works, let's decompose it into two steps. Step 1 is to generate an index of sampling positions, and step 2 is to find those positions in your matrix:
set.seed(1)
mysample <- sample(length(dataset), n, replace=replace)
mysample
[1] 8 12 18 28 7 27 29 20 19 2
dataset[mysample]
[1] 1 0 0 1 0 1 1 0 0 1
And, hey presto, the results of the two methods are identical.
Sample seems the best bet for you. To get 1000 random positions you can do something like:
rows = sample(1:nrow(dataset), 1000, replace = TRUE)
columns = sample(1:ncol(dataset), 1000, replace = TRUE)
I think this gives what you want, but ofcourse I could be mistaken.
Extracting the items from the matrix can be done like:
random_sample = mapply(function(row, col)
return(dataset[row,col]),
row = rows, col = columns)
Sampling strategies
In the comments you speak that your sample needs to have spread. A random sample has no garantuees that there will be no clusters, because of its random nature. There are several more sampling schemes that might be interesting to explore:
Regular sampling, skip the randomness and just sample regularly. Samples the entire matrix space evenly, but there is no randomness.
Stratified random sampling, you divide your matrix space into regular subset, and then sample randomly in those subsets. Presents a mix between random and regular.
To check if your random sampling produces good results, I'd repeat the random sampling a few times and compare the results (as I assume that the sampling will be input for another analysis?).

Resources