I have a matrix of indices I where some of the indices are repeated. I put an example below.
I have another matrix A with dimensions compatible with the indices and initiated to 0 everywhere. I would like to do something like
A[I] += 1
I face two issues:
A[I] = A[I] + 1 is too inefficient
matrix I has redundant indices. For example rows 2 & 6 are identical and I would like to obtain A[1,2] = 2
A partial answer would be to create a 3 columns matrix with the two first columns being the product of unique(I) and the third column with the counts, but I don't see any solution for that either. Any pointer or help would be greatly appreciated!
> I is:
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 1 4
[5,] 1 1
[6,] 1 2
[7,] 1 3
This may be quickest using sparse matrix methods (see the Matrix package and others).
Using standard matricies you could collapse the identical rows using the xtabs function then matrix assignment (edited based on comment):
I <- cbind(1, c(1:4,1:3))
tmp <- as.data.frame(xtabs( ~I[,1]+I[,2] ))
A <- matrix(0, nrow=5, ncol=5)
tmp2 <- as.matrix(tmp[,1:2])
tmp3 <- as.numeric(tmp2)
dim(tmp3) <- dim(tmp2)
A[ tmp3 ] <- tmp[,3]
A
You could probably make it a little quicker by pulling the core functionality out of as.data.frame.table rather than converting to data frame and back again.
Here is another version that may be more efficient. It will overwrite some 0's with other 0's computed by xtabs:
I <- cbind(1:5,1:5)
A <- matrix(0, 5, 5)
tmp <- xtabs( ~I[,2]+I[,1] )
A[ as.numeric(rownames(tmp)), as.numeric(colnames(tmp)) ] <- c(tmp)
A
If the A matrix has dimnames and the I matrix has the names instead of the indexes, then this later one will also work (just remove the as.numerics.
Here you go:
## Reproducible versions of your A and I objects
A <- matrix(0, nrow=2, ncol=5)
## For computations that follow, you'll be better off having this as a data.frame
## (Just use `I <- as.data.frame(I)` to convert a matrix object I).
I <- read.table(text=" 1 1
1 2
1 3
1 4
1 1
1 2
1 3", header=FALSE)
## Create data.frame with number of times each matrix element should
## be incremented
I$count <- ave(I[,1], I[,1], I[,2], FUN=length)
I <- unique(I)
## Replace desired elements, using a two column matrix (the "third form of
## indexing" mentioned in "Matrices and arrays" section" of ?"[").
A[as.matrix(I[1:2])] <- I[[3]]
A
# [,1] [,2] [,3] [,4] [,5]
# [1,] 2 2 2 1 0
# [2,] 0 0 0 0 0
Related
I have a basic knowledge of R and I try to automate some calculations on a data frame. I created a function and some code and would like some help to align everything with the R philosophy.
I have a panel dataset df that you can build as such:
# sample data frame
id <- c("i","i","i","j","j","j","k","k")
time <- c(1,2,3,1,2,3,1,2)
b1 <- c(1,0,1,0,0,1,1,0)
b2 <- c(0,0,1,0,0,0,1,1)
b3 <- c(0,1,0,1,0,0,0,0)
b4 <- c(0,0,0,0,1,0,1,1)
df <- data.frame(id,time,b1,b2,b3,b4)
I tranform it using data.table as such:
# data.table
### set-up
dt <- data.table(df)
setkey(dt,id,time)
### lead
nm1 <- grep("^b", colnames(dt), value=TRUE)
nm2 <- paste("lead", nm1, sep=".")
dt[, (nm2) := shift(.SD, type='lead'), by = id, .SDcols=nm1]
Now, I want to compute for each group id a matrix adding all the transitions from one row to the next one. Each matrix is stored into a list. I created a function that I apply to each group as such:
# empty list
m.out <- list()
# group i
m <- matrix(0,cat,cat + 1)
dt1 <- dt["i",c(nm1,nm2),with=FALSE]
m.out[[1]] <- calcMatrix(dt1)
# group j
m <- matrix(0,cat,cat + 1)
dt1 <- dt["j",c(nm1,nm2),with=FALSE]
m.out[[2]] <- calcMatrix(dt1)
# group k
m <- matrix(0,cat,cat + 1)
dt1 <- dt["k",c(nm1,nm2),with=FALSE]
m.out[[3]] <- calcMatrix(dt1)
How can I apply the function and create the list of matrix to all the groups of the data.table (especially if I try the code on a big dataset)?
I thought of this solution BUT IT DOES NOT WORK. The function itself does not create a matrix for each .SD and the list is not appended correctly:
m.out <- list()
m.out <- dt[,calcMatrix(.SD),by = id, .SDcols = c(nm1,nm2)]
The function calcMatrix is defined as such:
calcMatrix <- function(x) {
# number of "b" categories
cat <- length(nm1)
# vector of column indices
col.index <- grep("^b",colnames(x))
# number of rows in the data.table x
row.num <- nrow(x)
# fill in matrix
m <- matrix(0,cat,cat + 1)
for(i in col.index) {
for(j in 1:(row.num - 1)) {
m[i,] = m[i,] + as.integer(x[j,i,with=FALSE]) * c(0,as.matrix(x[j, .SD, .SDcols = nm2]))
}
m[i,1] = m[i,1] + as.integer(x[row.num,i,with=FALSE])
}
return(m)
}
This function may not be optimized for R due to the two loops. IS THERE A WAY TO GET RID OF THE LOOPS?
Edit: I can explain what I do in calcMatrix.
For each group id, I want to obtain a matrix with the number of
bi variables as rows and the number of bi variables +1 as columns. I will count the number of transitions per group id.
Then I take each bi and check which bj is reach at the next time (basically a transition from bi to bj).
I then do +1 in the matrix at the cell m[i,j+1] (the first column is used for the last row).
When we are at the last row (last time), there is no transition so if
bi=1 at that time, I do +1 in the first column (transition on itself).
This way, I count all the transition from bi to bj and all the last states. That is why I compute the lead with the shift function. I can add directly the lead row to the matrix. I was wondering if this could be written differently without looping but through vectorization as it is the philosophy in R.
There are two questions actually. Only one can be answered. The second on optimizing the function requires additional information.
How can I apply the function and create the list of matrix to all the groups of the data.table?
You may try lapply() to create a list of results:
lapply(dt[, unique(id)], function(.id) {calcMatrix(dt[id == .id, c(nm1,nm2), with=FALSE])})
which returns:
[[1]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 1 0
[2,] 1 0 0 0 0
[3,] 0 1 1 0 0
[4,] 0 0 0 0 0
[[2]]
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 1
[4,] 0 1 0 0 0
[[3]]
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 1 0 1
[2,] 1 0 1 0 1
[3,] 0 0 0 0 0
[4,] 1 0 1 0 1
I would like to replace the loops in the following code.
Test<-function(j){
card<-5
#matrix s is to hold the results
s <- matrix(rep(0,j*card),nrow=j,ncol=card,byrow=TRUE)
# Loop1
for (k in 1:j)
{
#A vector should be drawn from another matrix,
#for simplicity, I define a vector "sol" to be modified in Loop2
sol<-rep(1,card)
#Given the vector "sol", select a vector position randomly
#for a given no. of times (i.e. steps), say 10.
step<-10
# Loop2 - Modify value in sol
for (i in seq_len(step))
{
#Draw a position
r<-sample(seq_len(card),1)
#Each position has specific probabilities for
#assignment of possible values, meaning p is related to
#the position.
#For simplicity, just define the probabilities by random here.
p<-runif(3,0,1) # just create p for each step
p<-p/sum(p) #
#Finally, draw a value for the selected position and
#value of sol within this loop is kept changing.
sol[r]<-sample(1:3,1,prob=p)
}
# keep the result in matrix s.
s[k,]<-sol }
return(s)}
Given an input vector
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 1 1 1
It is expected to output a matrix like this:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 3 2 3
[2,] 1 1 1 1 3
[3,] 2 2 2 2 3
[4,] 2 1 2 2 1
[5,] 1 1 3 1 1
Each step in Loop2 depends on a probability vector, which is then used to change value in the sol. Then I tried to replace Loop2 with sapply as follows:
sapply(seq_len(steps), function(x){
r<-runif(seq_len(card),1)
sol[r]<-sample(1:3,1,prob=p) #Try to modify value in sol
})
s[k,]<-sol #Actually, no change in sol.
However, values in sol has no changed keeping all 1s, i.e. 1,1,1,1,1.
How can Loop2 be replaced by other apply family or other functions?
Thank you.
If I understand correctly what you're trying to achieve, you don't need apply() functions for this:
Test <- function(j) {
card <- 5
p<-runif(3,0,1)
p<-p/sum(p)
out <- matrix(sample(1:3, j*card, replace=T, prob=p), ncol=card, nrow=j)
return(out)
}
Test(5)
[,1] [,2] [,3] [,4] [,5]
[1,] 2 2 2 1 1
[2,] 1 2 3 2 2
[3,] 2 3 1 1 2
[4,] 1 2 1 2 1
[5,] 2 1 1 2 2
In order to refactor this function, notice that all the r <- sample(card,1) are independent draws from the multinomial distribution. This can be pulled out of the loop.
The second thing to note is that the conditional distribution of s[i,j] given r is 1 if the multinomial draw is zero, otherwise it is sample(3,1,prob=runif(3)). (The distribution does not change if a cell is selected repeatedly).
Put those two facts together, and we have this:
Test2 <- function(j,card=5,step=10) {
r <- t(rmultinom(j,step,rep(1,card)))
s <- apply(r, 1:2, function(x) if(x > 0) sample(3,1,prob=runif(3)) else 1)
return(s)
}
What about that:
test2 <- function(j) {
card <- 5
# Create a matrix where each of the j*card row is a p as defined in your original function.
p <- matrix(runif(3*j*card), ncol=3)
p <- t(apply(p, 1, function(x) x/sum(x)))
# For each row of p, draw a single value at random
draws <- apply(p, 1, function(x) sample(1:3, 1, prob=x))
# Format the output as a j*card matrix
out <- matrix(draws, ncol=card, byrow=TRUE)
return(out)
}
If test2() does what you want, it's roughly 300 times faster than Test() on my machine.
I am trying to split my matrix to a list by unique value in vector. Vector will have as many values as is in each column in matrix.
Here is an example:
#matrix
b <- cbind(c(2,2,1,0), c(2,2,1,5), c(2,2,5,6))
#vector
a <- c(5,5,4,1)
#??
#my outcome should looks like
v <- list(cbind(c(2,2), c(2,2), c(2,2)), c(1,1,5), c(0,5,6))
so basically, I want to split my matrix into multiple matrices by rows by unique values in a vector. More specifically, my vector is sorted from highest value to lowest value and I need to keep it in a list! As you can see in the example, v[[1]] is matrix for unique(a)[1] and so on.
lapply(split(seq_along(a), a), #split indices by a
function(m, ind) m[ind,], m = b)[order(unique(a))]
#$`5`
# [,1] [,2] [,3]
#[1,] 2 2 2
#[2,] 2 2 2
#
#$`4`
#[1] 1 1 5
#
#$`1`
#[1] 0 5 6
As fast as possible, I would like to replace the first zeros in some rows of a matrix with values stored in another vector.
There is a numeric matrix where each row is a vector with some zeros.
I also have two vectors, one containing the rows, in what to be replaced, and another the new values: replace.in.these.rows and new.values. Also, I can generate the vector of first zeroes with sapply
mat <- matrix(1,5,5)
mat[c(1,8,10,14,16,22,14)] <- 0
replace.in.these.rows <- c(1,2,3)
new.values <- c(91,92,93)
corresponding.poz.of.1st.zero <- sapply(replace.in.these.rows,
function(x) which(mat [x,] == 0)[1] )
Now I would like something that iterates over the index vectors, but without a for loop possibly:
matrix[replace.in.these.rows, corresponding.poz.of.the.1st.zero ] <- new.values
Is there a trick with indexing more than simple vectors? It could not use list or array(e.g.-by-column) as index.
By default R matrices are a set of column vectors. Do I gain anything if I store the data in a transposed form? It would mean to work on columns instead of rows.
Context:
This matrix stores contact ID-s of a network. This is not an adjacency matrix n x n, rather n x max.number.of.partners (or n*=30) matrix.
The network uses edgelist by default, but I wanted to store the "all links from X" together.
I assumed, but not sure if this is more efficient than always extract the information from the edgelist (multiple times each round in a simulation)
I also assumed that this linearly growing matrix form is faster than storing the same information in a same formatted list.
Some comments on these contextual assumptions are also welcome.
Edit: If only the first zeros are to be replace then this approach works:
first0s <-apply(mat[replace.in.these.rows, ] , 1, function(x) which(x==0)[1])
mat[cbind(replace.in.these.rows, first0s)] <- new.values
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 91 1 1 0 1
[2,] 1 1 1 1 92
[3,] 1 93 1 1 1
[4,] 1 1 0 1 1
[5,] 1 0 1 1 1
Edit: I thought that the goal was to replace all zeros in the chosen rows and this was the approach. A completely vectorized approach:
idxs <- which(mat==0, arr.ind=TRUE)
# This returns that rows and columns that identify the zero elements
# idxs[,"row"] %in% replace.in.these.rows
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
# That isolates the ones you want.
# idxs[ idxs[,"row"] %in% replace.in.these.rows , ]
# that shows what you will supply as the two column argument to "["
# row col
#[1,] 1 1
#[2,] 3 2
#[3,] 1 4
#[4,] 2 5
chosen.ones <- idxs[ idxs[,"row"] %in% replace.in.these.rows , ]
mat[chosen.ones] <- new.values[chosen.ones[,"row"]]
# Replace the zeros with the values chosen (and duplicated if necessary) by "row".
mat
#---------
[,1] [,2] [,3] [,4] [,5]
[1,] 91 1 1 91 1
[2,] 1 1 1 1 92
[3,] 1 93 1 1 1
[4,] 1 1 0 1 1
[5,] 1 0 1 1 1
I have a list of lists resulting from a bigsplit() operation (from package biganalytics, part of the bigmemory packages).
Each list represents a column in a matrix, and each list item is an index to a value of 1 in a binary matrix.
What is the best way to turn this list into a sparse binary (0/1) matrix?
Is using lapply() within an lapply() the only solution? How do I keep the factors naming the lists as names for the columns?
You can do this without an lapply whatsoever if you need a matrix.
Say you have a list constructed like this :
Test <- list(
col1=list(2,4,7),
col2=list(3,2,6,8),
col3=list(1,4,5,3,7)
)
First you construct a matrix with zeros of the correct dimensions. If you know them beforehand, that's easy. Otherwise you can derive easily:
n.cols <- length(Test)
n.ids <- sapply(Test,length)
n.rows <- max(unlist(Test))
out <- matrix(0,nrow=n.rows,ncol=n.cols)
Then you use the fact that matrices are filled columnwise to calculate the index of each cell that has to become one :
id <- unlist(Test)+rep(0:(n.cols-1),n.ids)*n.rows
out[id] <- 1
colnames(out) <- names(Test)
This gives :
> out
col1 col2 col3
[1,] 0 0 1
[2,] 1 1 0
[3,] 0 1 1
[4,] 1 0 1
[5,] 0 0 1
[6,] 0 1 0
[7,] 1 0 1
[8,] 0 1 0
You might also consider using the Matrix package which deals with large sparse matrices in a more efficient way than base R. You can build a sparse matrix of 0s and 1s by describing which rows and columns should be 1s.
library(Matrix)
Test <- list(
col1=list(2,4,7),
col2=list(3,2,6,8),
col3=list(1,4,5,3,7)
)
n.ids <- sapply(Test,length)
vals <- unlist(Test)
out <- sparseMatrix(vals, rep(seq_along(n.ids), n.ids))
The result is
> out
8 x 3 sparse Matrix of class "ngCMatrix"
[1,] . . |
[2,] | | .
[3,] . | |
[4,] | . |
[5,] . . |
[6,] . | .
[7,] | . |
[8,] . | .
Using Joris' example, here's a syntactically simple way using sapply/replace. I suspect Joris' approach is faster, because it fills in a pre-allocated matrix, whereas my approach implicitly involves cbinding a bunch of columns, and so would require repeated memory allocations for the columns (is that true?).
Test <- list(
col1=list(2,4,7),
col2=list(3,2,6,8),
col3=list(1,4,5,3,7)
)
> z <- rep(0, max(unlist(Test)))
> sapply( Test, function(x) replace(z,unlist(x),1))
col1 col2 col3
[1,] 0 0 1
[2,] 1 1 0
[3,] 0 1 1
[4,] 1 0 1
[5,] 0 0 1
[6,] 0 1 0
[7,] 1 0 1
[8,] 0 1 0
Here is some sample data that seems to fit your description.
a <- as.list(sample(20, 5))
b <- as.list(sample(20, 5))
c <- as.list(sample(20, 5))
abc <- list(a = a, b = b, c = c)
I do not see a way to do this with nested lapply() but here is another way. It would be nice to eliminate the unlist(), but maybe someone else can improve on this.
sp_to_bin <- function(splist) {
binlist <- numeric(100)
binlist[unlist(splist)] <- 1
return(binlist)
}
bindf <- data.frame(lapply(abc, sp_to_bin))
To build on Joris's answer, which used a scalar index vector to fill in the output matrix, you can also use a matrix index vector to fill in the output matrix; this can sometimes be a little clearer to write or understand later.
Test <- list(
col1=list(2,4,7),
col2=list(3,2,6,8),
col3=list(1,4,5,3,7)
)
n.cols <- length(Test)
n.ids <- sapply(Test,length)
vals <- unlist(Test)
n.rows <- max(vals)
idx <- cbind(vals, rep(seq_along(n.ids), n.ids))
out <- matrix(0,nrow=n.rows,ncol=n.cols)
out[idx] <- 1
colnames(out) <- names(Test)
The result is the same.