back and forth to dummy variables in R - r

So, I've been using R on and off for two years now and been trying to get this whole idea of vectorization. Since I deal a lot with dummy variables from multiple response sets from surveys I thought it would be interesting to learn with this case.
The idea is to go from multiple responses to dummy variables (and back), for example: "Of these 8 different chocolates, which are your favorite ones (choose up to 3) ?"
Sometimes we code this as dummy variables (1 for person likes "Cote d'Or", 0 for person doesn't like it), with 1 variable per option, and some times as categorical (1 for person likes "Cote d'Or", 2 for person likes "Lindt", and so on), with 3 variables for the 3 choices.
So, basically I can end up with one a matrix which lines are like
1,0,0,1,0,0,1,0
Or a matrix with lines like
1,4,7
And the idea, as mentioned, is to go from one to the other. So far I got a loop solution for each case and a vectorized solution for going from dummy to categorical. I would appreciate any further insigh into this matter and a vectorized solution for the categorical to dummy step.
DUMMY TO NOT DUMMY
vecOrig<-matrix(0,nrow=18,ncol=8) # From this one
vecDest<-matrix(0,nrow=18,ncol=3) # To this one
# Populating the original matrix.
# I'm pretty sure this could have been added to the definition of the matrix,
# but I kept getting repeated numbers.
# How would you vectorize this?
for (i in 1:length(vecOrig[,1])){
vecOrig[i,]<-sample(vec)
}
# Now, how would you vectorize this following step...
for(i in 1:length(vecOrig[,1])){
vecDest[i,]<-grep(1,vecOrig[i,])
}
# Vectorized solution, I had to transpose it for some reason.
vecDest2<-t(apply(vecOrig,1,function(x) grep(1,x)))
NOT DUMMY TO DUMMY
matOrig<-matrix(0,nrow=18,ncol=3) # From this one
matDest<-matrix(0,nrow=18,ncol=8) # To this one.
# We populate the origin matrix. Same thing as the other case.
for (i in 1:length(matOrig[,1])){
matOrig[i,]<-sample(1:8,3,FALSE)
}
# this works, but how to make it vectorized?
for(i in 1:length(matOrig[,1])){
for(j in matOrig[i,]){
matDest[i,j]<-1
}
}
# Not a clue of how to vectorize this one.
# The 'model.matrix' solution doesn't look neat.

Vectorized solutions:
Dummy to not dummy
vecDest <- t(apply(vecOrig == 1, 1, which))
Not dummy to dummy (back to the original)
nCol <- 8
vecOrig <- t(apply(vecDest, 1, replace, x = rep(0, nCol), values = 1))

This might provide some inside for the first part:
#Create example data
set.seed(42)
vecOrig<-matrix(rbinom(20,1,0.2),nrow=5,ncol=4)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 1
[2,] 1 0 0 1
[3,] 0 0 1 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Note that this does not assume, that the number of ones is equal in each line (e.g., you wrote "choose up to 3").
#use algebra to create position numbers
vecDest <- t(t(vecOrig)*1:ncol(vecOrig))
[,1] [,2] [,3] [,4]
[1,] 1 0 0 4
[2,] 1 0 0 4
[3,] 0 0 3 0
[4,] 1 0 0 0
[5,] 0 0 0 0
Now, we remove the zeros. Thus, we have to turn the object into a list.
vecDest <- split(t(vecDest), rep(1:nrow(vecDest), each = ncol(vecDest)))
lapply(vecDest,function(x) x[x>0])
$`1`
[1] 1 4
$`2`
[1] 1 4
$`3`
[1] 3
$`4`
[1] 1
$`5`
numeric(0)

Related

assign cluster labels to data using a cluster assignment matrix

Hi I am using R and have a cluster assignment matrix that comes out of my clustering function. (I am applying a clustering algorithm on a gaussian mixture data) I want to create a data matrix of clusters. Here is a toy example of what I want to do.
#simulate data
dat=Z<-c(rnorm(2,0,1),rnorm(2,2,3),rnorm(3,0,1),rnorm(3,2,3))
dat
[1] -0.5350681 1.0444655 2.9229136 8.2528266 -0.7561170 -1.0240702 -1.0012780
[8] -0.1322981 7.8525855 2.2278264
# Making up a cluster assignment matrix (actually this one comes out of my
#clustering function
amat<-matrix(c(1,1,0,0,1,1,1,0,0,0,0,0,1,1,0,0,0,1,1,1), ncol=2, nrow=10)
amat
[,1] [,2]
[1,] 1 0
[2,] 1 0
[3,] 0 1
[4,] 0 1
[5,] 1 0
[6,] 1 0
[7,] 1 0
[8,] 0 1
[9,] 0 1
[10,] 0 1
I want to create dataframe or vector called (say) "clust" that contains cluster labels as follows using the assignment matrix given above.Basically it uses first column and second column of assignment matrix and assigns label 1 to data coming from normal distribution N(0,1) and assigns label 2 to the data coming from normal distribution N(2,3).Any help is appreciated. Thanks in advance.
# clust should look like this (I have no idea how to create this using amat and dat)
clust
[1] 1 1 2 2 1 1 1 2 2 2
The vector is already binary. We can add 1L to the second column:
clust <- amat[,2] + 1L
[1] 1 1 2 2 1 1 1 2 2 2
(The suffix L coerces the value to integer)
Isn't this essentially
1 * column1 + 2 * column2 +3 * column3 and so on?
that should be straight forward to write as a matrix multiplocation woth [1,2,3,4,...] and a sum operation.

Fastest way to populate a matrix using row/column indicies stored in vectors

I'm trying to do something that seems relatively straightforward to do with something apply-esque, but I can only get it to work using a for loop.
The general idea is I have two vectors, with one vector corresponding to a row in the matrix and another vector corresponding to the column, both the same length. I start with a 0 matrix, and increment [row,column] based on the pair of values in the two vectors. For example:
vectorCols <- c(1,2,3,1,3)
vectorRows <- c(2,1,2,3,2)
countMat <- matrix(rep(0,9),ncol=3)
And at the end, countMat is:
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 2
[3,] 1 0 0
This is pretty manageable with a for loop:
for (i in 1:length(vectorCols)){
countMat[vectorRows[i],vectorCols[i]] <- countMat[vectorRows[i],vectorCols[i]] + 1
}
But I can't help thinking there is a better way to do this in R. I've tried using the apply family of functions, but these don't cooperate well when you want to assign something. I know I could use mapply and build each element of countMat one value at a time, but this seems inefficient--vectorRows and vectorCols are very long, and it seems wasteful to fully traverse them an entire time for each cell in countMat. But other than a loop and mapply, I can't think of how to do this. I've considered using assign with one of the apply family, but there's a caveat--my matrix actually has names for the columns and rows, with the names stored in vectorCols and vectorRows, and it seems assign doesn't want to play well something like countMat["rowName"]["columnName"] (not to mention thatapply` will still want to return a value for each step in the iteration).
Any suggestions? I'd also be curious if there is an ideal way to do this if I don't have names for the vector columns and rows. If that's the case then maybe I can convert vectorCols and vectorRows to numbers, then build the matrix, then rename everything.
Thanks all.
Here are some solutions. No packages are needed.
1) table
table(vectorRows, vectorCols)
giving:
vectorCols
vectorRows 1 2 3
1 0 1 0
2 1 0 2
3 1 0 0
Note that if there is any row or column with no entries then it will not appear.
2) aggregate
ag <- aggregate( Freq ~ ., data.frame(Freq = 1, vectorRows, vectorCols), sum)
countMat[as.matrix(ag[-3])] <- ag[[3]]
giving:
> countMat
[,1] [,2] [,3]
[1,] 0 1 0
[2,] 1 0 2
[3,] 1 0 0
3) xtabs
xtabs(~ vectorRows + vectorCols)
giving:
vectorCols
vectorRows 1 2 3
1 0 1 0
2 1 0 2
3 1 0 0

How to sort a matrix/data.frame by all columns

I have a matrix, e.g.:
a = rep(0:1, each=4)
b = rep(rep(0:1, each=2), 2)
c = rep(0:1, times=4)
mat = cbind(c,b,a)
I need to sort all columns of this matrix. I know how to do this by sorting specific columns (i.e. a limited number of columns).
mat[order(mat[,"c"],mat[,"b"],mat[,"a"]),]
c b a
[1,] 0 0 0
[2,] 0 0 1
[3,] 0 1 0
[4,] 0 1 1
[5,] 1 0 0
[6,] 1 0 1
[7,] 1 1 0
[8,] 1 1 1
However, I need a generic way of doing this without calling any column names, because I could have any number of columns. How can I sort by a large number of columns?
Here's a concise solution:
mat[do.call(order,as.data.frame(mat)),];
## c b a
## [1,] 0 0 0
## [2,] 0 0 1
## [3,] 0 1 0
## [4,] 0 1 1
## [5,] 1 0 0
## [6,] 1 0 1
## [7,] 1 1 0
## [8,] 1 1 1
The call to as.data.frame() converts the matrix to a data.frame in the intuitive way, i.e. each matrix column becomes a list component in the new data.frame. From that, you can effectively pass each matrix column to a single invocation of order() by passing the listified form of the matrix as the second argument of do.call().
This will work for any number of columns.
It's not a dumb question. The reason that mat[order(as.data.frame(mat)),] does not work is because order() does not order data.frames by row.
Instead of returning a row order for the data.frame based on ordering the column vectors from left to right (which is what my solution does), it basically flattens the data.frame to a single big vector and orders that.
So, in fact, order(as.data.frame(mat)) is equivalent to order(mat), as a matrix is treated as a flat vector as well.
For your particular data, this returns 24 indexes, which could theoretically be used to index (as a vector) the original matrix mat, but since in the expression mat[order(as.data.frame(mat)),] you're trying to use them to index just the row dimension of mat, some of the indexes are past the highest row index, so you get a "subscript out of bounds" error.
See ?do.call.
I don't think I can explain it better than the help page; take a look at the examples, play with them until you get how it works. Basically, you need to call it when the arguments you want to pass to a single invocation of a function are trapped inside a list.
You can't pass the list itself (because then you're not passing the intended arguments, you're passing a list containing the intended arguments), so there must be a primitive function that "unwraps" the arguments from the list for the function call.
This is a common primitive in programming languages where functions are first-class objects, notably (besides R's do.call()) JavaScript's apply(), Python's (deprecated) apply(), and vim's call().

R Turning a list into a matrix when the list contains objects of "different size"

I've seen a couple of questions about turning matrices into lists (not really clear why you would want that) but the reverse operation I've been unable to find.
Basically, following
# ind.dum = data frame with 29 observations and 2635 variables
for (i in 1:ncol(ind.dum))
tmp[[i]]<-which(rollapply(ind.dum[,i],4,identical,c(1,0,0,0),by.column=T))
I got a list of 2635 objects, most of which contain 1 value, bust some up to 7. I'd need to convert this to a matrix with 2635 rows and as many columns as necessary to fit every value in a separate cells (with 0 values for the rest).
I tried all the coerce measures I know (as.data.frame, as.matrix ...) and also the option to define a new matrix with the maximum dimensions but nothing works.
m<-matrix(0,nrow=2635,ncol=7)
tmp_m<-structure(tmp,dim=dim(m))
Error in structure(tmp,dim=dim(m))dims [product 18445] do not match the length of object [2635]
I'm sure there's a quick fix for this so I'm hoping someone can help me with it. Btw, my values in the tmp list's objects are numeric, although some are "integer(0)" , i.e. when the pattern c(1,0,0,0) was not found in the columns of the original ind.dum matrix.
Not sure if there is a way to use unlist without losing the information about which values belong originally to the same row...
Desired Output
A matrix or dataframe with 2635 rows and 7 columns and looking like this
12 0 0 0 0 0 0
8 14 0 0 0 0 0
0 0 0 0 0 0 0
1 4 8 12 0 0 0
...
The values basically refer to years in which a specific pattern started. I need to be able to be able to use that information to tie this problem to an earlier problem described before (see this link).
Try this for example:
do.call(rbind,lapply(ll,
function(x)
if(length(x)==1)c(x,rep(0,6))
else x))
Here's a fast alternative that does what it sounds like you are describing:
First, sample data always helps:
LL <- list(1:3, numeric(0), c(1:3,1), 1:7)
LL
# [[1]]
# [1] 1 2 3
#
# [[2]]
# numeric(0)
#
# [[3]]
# [1] 1 2 3 1
#
# [[4]]
# [1] 1 2 3 4 5 6 7
Second, we'll make use of a little trick referred to as matrix indexing to fill an empty matrix with the values from your list.
## We need to know how many columns are needed for each list item
Ncol <- vapply(LL, length, 1L)
## M is our empty matrix, pre-filled with zeroes
M <- matrix(0, nrow = length(LL), ncol = max(Ncol))
## IJ is the row/column combination where values need to be inserted
IJ <- cbind(rep(seq_along(Ncol), times = Ncol), sequence(Ncol))
## Extract and insert!
M[IJ] <- unlist(LL, use.names = FALSE)
## View the result
M
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 1 2 3 0 0 0 0
# [2,] 0 0 0 0 0 0 0
# [3,] 1 2 3 1 0 0 0
# [4,] 1 2 3 4 5 6 7
I have a solution.
Not sure if it is good enough or there's any bug.
LL <- list(1:3, numeric(0), c(1:3, 1), 1:7)
with(data.frame(m <- plyr::rbind.fill.matrix(lapply(LL, matrix, nrow = 1))), replace(m, is.na(m), 0))

How to easily create dissimilarity matrix from vector of differences?

In my research each subject was given n*(n-1)/2 questions about his subjective opinion about dissimilarity between n=5 objects (for later use with 3-way multidimensional scaling).
I want to create a dissimilarity matrix from the 10-item vector v, arranged e.g. in the following fashion (for n=5):
1
2 5
3 6 8
4 7 9 10
This is a code sample code for achieving it for this particular n:
dissim<-rep(0,n*n)
dim(dissim)<-c(5,5)
dissim[2,1]<-v[1]
dissim[3,1]<-v[2]
dissim[4,1]<-v[3]
dissim[5,1]<-v[4]
dissim[3,2]<-v[5]
dissim[4,2]<-v[6]
dissim[5,2]<-v[7]
dissim[4,3]<-v[8]
dissim[5,3]<-v[9]
dissim[5,4]<-v[10]
Is there any utility function which helps doing it for any n? I know I can use two nested loops to do it, but the code would be more clear if I used a dedicated function.
And maybe I would learn about the existence of another useful library in the process?
n <- 5
mat <- matrix(0, ncol = n, nrow = n)
mat[lower.tri(mat)] <- 1:10
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 1 0 0 0 0
[3,] 2 5 0 0 0
[4,] 3 6 8 0 0
[5,] 4 7 9 10 0
Er... By chance I found the solution myself. It so happens, that the internal structure of the dist object is just the vector v. So what works is this:
dissim<-v
class(dissim)='dist'
attr(dissim,"Size")<-5
dissim<-as.dist(dissim)
It works now, but I am not sure if this is a documented way and will always be valid.

Resources