Related
I have two large sparse matrices (about 41,000 x 55,000 in size). The density of nonzero elements is around 10%. They both have the same row index and column index for nonzero elements.
I now want to modify the values in the first sparse matrix if values in the second matrix are below a certain threshold.
library(Matrix)
# Generating the example matrices.
set.seed(42)
# Rows with values.
i <- sample(1:41000, 227000000, replace = TRUE)
# Columns with values.
j <- sample(1:55000, 227000000, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000)
# Values for the second matrix.
x2 <- sample(1:3, 227000000, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
I now get the rows, columns and values from the first matrix in a new matrix. This way, I can simply subset them and only the ones I am interested in remain.
# Getting the positions and values from the matrices.
position_matrix_from_m1 <- rbind(i = m1#i, j = summary(m1)$j, x = m1#x)
position_matrix_from_m2 <- rbind(i = m2#i, j = summary(m2)$j, x = m2#x)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- position_matrix_from_m1[,position_matrix_from_m1[3,] > 0 & position_matrix_from_m1[3,] < 0.05]
# We add 1 to the values, since the sparse matrix is 0-based.
position_matrix_from_m1[1,] <- position_matrix_from_m1[1,] + 1
position_matrix_from_m1[2,] <- position_matrix_from_m1[2,] + 1
Now I am getting into trouble. Overwriting the values in the second matrix takes too long. I let it run for several hours and it did not finish.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
I thought about pasting the row and column information together. Then I have a unique identifier for each value. This also takes too long and is probably just very bad practice.
# We would get the unique identifiers after the subsetting.
m1_identifiers <- paste0(position_matrix_from_m1[1,], "_", position_matrix_from_m1[2,])
m2_identifiers <- paste0(position_matrix_from_m2[1,], "_", position_matrix_from_m2[2,])
# Now, I could use which and get the position of the values I want to change.
# This also uses to much memory.
m2_identifiers_of_interest <- which(m2_identifiers %in% m1_identifiers)
# Then I would modify the x values in the position_matrix_from_m2 matrix and overwrite m2#x in the sparse matrix object.
Is there a fundamental error in my approach? What should I do to run this efficiently?
Is there a fundamental error in my approach?
Yes. Here it is.
# This takes hours.
m2[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 1
m1[position_matrix_from_m1[1,], position_matrix_from_m1[2,]] <- 0
Syntax as mat[rn, cn] (whether mat is a dense or sparse matrix) is selecting all rows in rn and all columns in cn. So you get a length(rn) x length(cn) matrix. Here is a small example:
A <- matrix(1:9, 3, 3)
# [,1] [,2] [,3]
#[1,] 1 4 7
#[2,] 2 5 8
#[3,] 3 6 9
rn <- 1:2
cn <- 2:3
A[rn, cn]
# [,1] [,2]
#[1,] 4 7
#[2,] 5 8
What you intend to do is to select (rc[1], cn[1]), (rc[2], cn[2]) ..., only. The correct syntax is then mat[cbind(rn, cn)]. Here is a demo:
A[cbind(rn, cn)]
#[1] 4 8
So you need to fix your code to:
m2[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 1
m1[cbind(position_matrix_from_m1[1,], position_matrix_from_m1[2,])] <- 0
Oh wait... Based on your construction of position_matrix_from_m1, this is just
ij <- t(position_matrix_from_m1[1:2, ])
m2[ij] <- 1
m1[ij] <- 0
Now, let me explain how you can do better. You have underused summary(). It returns a 3-column data frame, giving (i, j, x) triplet, where both i and j are index starting from 1. You could have worked with this nice output directly, as follows:
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# you never seem to use `position_matrix_from_m2` so I skip it
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
Now you can do:
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2[ij] <- 1
m1[ij] <- 0
Is there a even better solution? Yes! Note that nonzero elements in m1 and m2 are located in the same positions. So basically, you just need to change m2#x according to m1#x.
ind <- m1#x > 0 & m1#x < 0.05
m2#x[ind] <- 1
m1#x[ind] <- 0
A complete R session
I don't have enough RAM to create your large matrix, so I reduced your problem size a little bit for testing. Everything worked smoothly.
library(Matrix)
# Generating the example matrices.
set.seed(42)
## reduce problem size to what my laptop can bear with
squeeze <- 0.1
# Rows with values.
i <- sample(1:(41000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Columns with values.
j <- sample(1:(55000 * squeeze), 227000000 * squeeze ^ 2, replace = TRUE)
# Values for the first matrix.
x1 <- runif(227000000 * squeeze ^ 2)
# Values for the second matrix.
x2 <- sample(1:3, 227000000 * squeeze ^ 2, replace = TRUE)
# Constructing the matrices.
m1 <- sparseMatrix(i = i, j = j, x = x1)
m2 <- sparseMatrix(i = i, j = j, x = x2)
## give me more usable RAM
rm(i, j, x1, x2)
##
## fix to your code
##
m1a <- m1
m2a <- m2
# Getting (i, j, x) triplet (stored as a data.frame) for both `m1` and `m2`
position_matrix_from_m1 <- summary(m1)
# Subsetting to get the elements of interest.
position_matrix_from_m1 <- subset(position_matrix_from_m1, x > 0 & x < 0.05)
ij <- as.matrix(position_matrix_from_m1[, 1:2])
m2a[ij] <- 1
m1a[ij] <- 0
##
## the best solution
##
m1b <- m1
m2b <- m2
ind <- m1#x > 0 & m1#x < 0.05
m2b#x[ind] <- 1
m1b#x[ind] <- 0
##
## they are identical
##
all.equal(m1a, m1b)
#[1] TRUE
all.equal(m2a, m2b)
#[1] TRUE
Caveat:
I know that some people may propose
m1c <- m1
m2c <- m2
logi <- m1 > 0 & m1 < 0.05
m2c[logi] <- 1
m1c[logi] <- 0
It looks completely natural in R's syntax. But trust me, it is extremely slow for large matrices.
I am trying to simulate an unlikely situation in a videogame using a Monte Carlo simulation. I'm extremely new at coding and thought this would be a fun situation to simulate.
There are 3 targets and they are being attacked 8 times independently. My problem comes with how to deal with the fact that one of the columns cannot be attacked more than 6 times, when there are 8 attacks.
I would like to take any attack aimed at column 2 select one of the other 2 columns at random to attack instead, but only if column 2 has been attacked 6 times already.
Here is my attempt to simulate with 5000 repeats, for example.
#determine number of repeats
trial <- 5000
#create matrix with a row for each trial
m <- matrix(0, nrow = trial, ncol = 3)
#The first for loop is for each row
#The second for loop runs each attack independently, sampling 1:3 at random, then adding one to that position of the row.
#The function that is called by ifelse() when m[trial, 2] > 6 = TRUE is the issue.
for (trial in 1:trial){
for (attack in 1:8) {
target <- sample(1:3, 1)
m[trial, target] <- m[trial, target] + 1
ifelse(m[trial, 2] > 6, #determines if the value of column 2 is greater than 6 after each attack
function(m){
m[trial, 2] <- m[trial, 2] - 1 #subtract the value from the second column to return it to 6
newtarget <- sample(c(1,3), 1) #select either column 1 or 3 as a new target at random
m[trial, newtarget] <- m[trial, newtarget] + 1 #add 1 to indicate the new target has been selected
m}, #return the matrix after modification
m) #do nothing if the value of the second column is <= 6
}
}
For example, if I have the matrix below:
> matrix(c(2,1,5,7,1,0), nrow = 2, ncol = 3)
[,1] [,2] [,3]
[1,] 2 5 1
[2,] 1 7 0
I would like the function to look at the 2nd line of the matrix, subtract 1 from 7, and then add 1 to either column 1 or 3 to create c(2,6,0) or c(1,6,1). I would like to learn how to do this within the loop, but it could be done afterwards as well.
I think I am making serious, fundamental error with how to use function(x) or ifelse.
Thank you.
Here's an improved version of your code:
set.seed(1)
trial <- 5000
#create matrix with a row for each trial
m <- matrix(0, nrow = trial, ncol = 3)
#The first for loop is for each row
#The second for loop runs each attack independently, sampling 1:3 at random, then adding one to that position of the row.
#The function that is called by ifelse() when m[trial, 2] > 6 = TRUE is the issue.
for (i in 1:trial){
for (attack in 1:8) {
target <- sample(1:3, 1)
m[i, target] <- m[i, target] + 1
#determines if the value of column 2 is greater than 6 after each attack
if(m[i, 2] > 6){
#subtract the value from the second column to return it to 6
m[i, 2] <- m[i, 2] - 1
#select either column 1 or 3 as a new target at random
newtarget <- sample(c(1,3), 1)
#add 1 to indicate the new target has been selected
m[i, newtarget] <- m[i, newtarget] + 1
}
}
}
# Notice the largest value in column 2 is no greater than 6.
apply(m, 2, max)
set.seed is used to make the results reproducible (usually just used for testing). The ifelse function has a different purpose than the normal if-else control flow. Here's an example:
x = runif(100)
ifelse(x < 0.5, 0, x)
You'll notice any element in x that is less than 0.5 is now zero. I changed your code to have an if block. Notice that m[i, 2] > 6 returns a single TRUE or FALSE whereas in the small example above, x < 0.5 a vector of logicals is returned. So ifelse can take a vector of logicals, but the if block requires there be only a single logical.
You were on the right track with using function, but it just isn't necessary in this case. Often, but not always, you'll define a function like this:
f = function(x)
x^2
But just returning the value doesn't mean what you want is changed:
x = 5
f(5) # 25
x # still 5
For more on this, look up function scope in R.
Lastly, I changed the loop to be i in 1:trial instead of trial in 1:trial. You probably wouldn't notice any issues in your case, but it is better practice to use a separate variable than that which makes up the range of the loop.
Hope this helps.
P.S. R isn't really known for it's speed when looping. If you want to make things goes faster, you'll typically need to vectorize your code.
I have a 25x25 matrix with numeric values and I want to choose through some conditions ! For example I want only the values from 0 to 0.2 to install them in another matrix how can I do this ?
x<-matrix(rnorm(25*25),25,25)
which(x>0.2) # indices where x>0.2
n<-40
h<-hist(x,breaks = seq(min(x),max(x),length.out = n+1),plot = F) # For multiple ranges and counts
h$breaks #n+1 break points
h$count #n counts of numbers between those breakpoints
What you want can be done with simple logical operations, see file R-intro.pdf that comes with your distribution of R, section 2.7 Index vectors; selecting and modifying subsets of a data set.
set.seed(1356) # make the results reproducible
m <- matrix(rnorm(25*25), 25) # input matrix
i <- 0 <= m & m <= 0.2 # logical index into 'm'
# create a result matrix with the same dimensions as the input
m2 <- matrix(NA, nrow = nrow(m), ncol = ncol(m))
m2[i] <- m[i] # assign the values you want
m2
sum(i) # count of values in [0, 0.2]
sum(m < 0) # count of values less than zero
sum(m > 0.2) # count of values greater than 0.2
I have two vectors in R and want to generate a new matrix based on them.
a=c(1,2,1,2,3) # a[1] is 1: thus row 1, column 1 should be equal to...
b=c(10,20,30,40,50) # ...b[1], or 10.
I want to produce matrix 'v' BUT without my 'for' loop through columns of v and my multiplication:
v = as.data.frame(matrix(0,nrow=length(a),ncol=length(unique(a))))
for(i in 1:ncol(v)) v[[i]][a==i] <- 1 # looping through columns of 'v'
v <- v*b
I am sure there is a fast/elegant way to do it in R. At least of expanding 'a' into the earlier version of 'v' (before its multiplication by 'b').
Thanks a lot!
This is one way that sparse matrices can be defined.
Matrix::sparseMatrix(i = seq_along(a), j = a, x = b)
# Setup the problem:
set.seed(4242)
a <- sample(1:100, 1000000, replace = TRUE)
b <- sample(1:500, length(a), replace = TRUE)
# Start the timer
start.time <- proc.time()[3]
# Actual code
# We use a matrix instead of a data.frame
# The number of columns matches the largest column index in vector "a"
v <- matrix(0,nrow=length(a), ncol= max(a))
v[cbind(seq_along(a), a)] <- b
# Show elapsed time
stop.time <- proc.time()[3]
cat("elapsed time is: ", stop.time - start.time, "seconds.\n")
# For a million rows and a hundred columns, my prehistoric
# ... laptop says: elapsed time is: 2.597 seconds.
# these checks take much longer to run than the function itself
# Make sure the modified column in each row matches vector "a"
stopifnot(TRUE == all.equal(a, apply(v!=0, 1, which)))
# Make sure the modified value in each row equals vector "b"
stopifnot(TRUE == all.equal(rowSums(v), b))
I am filling a 10x10 martix (mat) randomly until sum(mat) == 100
I wrote the following.... (i = 2 for another reason not specified here but i kept it at 2 to be consistent with my actual code)
mat <- matrix(rep(0, 100), nrow = 10)
mat[1,] <- c(0,0,0,0,0,0,0,0,0,1)
mat[2,] <- c(0,0,0,0,0,0,0,0,1,0)
mat[3,] <- c(0,0,0,0,0,0,0,1,0,0)
mat[4,] <- c(0,0,0,0,0,0,1,0,0,0)
mat[5,] <- c(0,0,0,0,0,1,0,0,0,0)
mat[6,] <- c(0,0,0,0,1,0,0,0,0,0)
mat[7,] <- c(0,0,0,1,0,0,0,0,0,0)
mat[8,] <- c(0,0,1,0,0,0,0,0,0,0)
mat[9,] <- c(0,1,0,0,0,0,0,0,0,0)
mat[10,] <- c(1,0,0,0,0,0,0,0,0,0)
i <- 2
set.seed(129)
while( sum(mat) < 100 ) {
# pick random cell
rnum <- sample( which(mat < 1), 1 )
mat[rnum] <- 1
##
print(paste0("i =", i))
print(paste0("rnum =", rnum))
print(sum(mat))
i = i + 1
}
For some reason when sum(mat) == 99 there are several steps extra...I would assume that once i = 91 the while would stop but it continues past this. Can somone explain what I have done wrong...
If I change the while condition to
while( sum(mat) < 100 & length(which(mat < 1)) > 0 )
the issue remains..
Your problem is equivalent to randomly ordering the indices of a matrix that are equal to 0. You can do this in one line with sample(which(mat < 1)). I suppose if you wanted to get exactly the same sort of output, you might try something like:
set.seed(144)
idx <- sample(which(mat < 1))
for (i in seq_along(idx)) {
print(paste0("i =", i))
print(paste0("rnum =", idx[i]))
print(sum(mat)+i)
}
# [1] "i =1"
# [1] "rnum =5"
# [1] 11
# [1] "i =2"
# [1] "rnum =70"
# [1] 12
# ...
See ?sample
Arguments:
x: Either a vector of one or more elements from which to choose,
or a positive integer. See ‘Details.’
...
If ‘x’ has length 1, is numeric (in the sense of ‘is.numeric’) and
‘x >= 1’, sampling _via_ ‘sample’ takes place from ‘1:x’. _Note_
that this convenience feature may lead to undesired behaviour when
‘x’ is of varying length in calls such as ‘sample(x)’. See the
examples.
In other words, if x in sample(x) is of length 1, sample returns a random number from 1:x. This happens towards the end of your loop, where there is just one 0 left in your matrix and one index is returned by which(mat < 1).
The iteration repeats on level 99 because sample() behaves very differently when the first parameter is a vector of length 1 and when it is greater than 1. When it is length 1, it assumes you a random number from 1 to that number. When it has length >1, then you get a random number from that vector.
Compare
sample(c(99,100),1)
and
sample(c(100),1)
Of course, this is an inefficient way of filling your matrix. As #josilber pointed out, a single call to sample could do everything you need.
The issue comes from how sample and which do the sampling when you have only a single '0' value left.
For example, do this:
mat <- matrix(rep(1, 100), nrow = 10)
Now you have a matrix of all 1's. Now lets make two numbers 0:
mat[15]<-0
mat[18]<-0
and then sample
sample(which(mat<1))
[1] 18 15
by adding a size=1 argument you get one or the other
now lets try this:
mat[18]<-1
sample(which(mat<1))
[1] 3 13 8 2 4 14 11 9 10 5 15 7 1 12 6
Oops, you did not get [1] 15 . Instead what happens in only a single integer (15 in this case) is passed tosample. When you do sample(x) and x is an integer, it gives you a sample from 1:x with the integers in random order.