I am running a simulation in R, which I am trying to make more efficient.
A little bit of background: this is an abstract simulation to test the effects of mutation on a population. The population has N individuals and each individuals has a genotype of M letters, each letter can be one of the twenty amino acids (I denote as 0:19).
One of the most (computationally) expensive tasks involves taking a matrix "mat" with M rows and N columns, which initially starts as a matrix of all zeroes,
mat <- matrix(rep(0,M*N),nrow=M)
And then changing (mutating) at least one letter in the genotype of each individual. The reason I say at least is, I would ideally like to set a mutation rate (mutrate) that, if I set to 2 in my overall simulation function, it will cause 2 mutations in the matrix per individual.
I found two rather computationally expensive ways to do so. As you can see below, only the second method incorporates the mutation rate parameter mutrate (I could not easily of think how to incorporate it into the first).
#method 1
for(i in 1:N){
position <- floor(runif(N, min=0, max=M))
letter <- floor(runif(N, min=0, max=19))
mat[position[i],i] = letter[i]}
#method 2, somewhat faster and incorporates mutation rate
mat <- apply(mat,2,function(x) (x+sample(c(rep(0,M-mutrate),sample(0:19,size=mutrate))%%20))))
The second method incorporates a modulus because genotype values have to be between 0 and 19 as I mentioned.
A few additional notes for clarity:
I don't strictly need every individual to get exactly the same mutation amount. But that being said, the distribution should be narrow enough such that, if mutrate = 2, most individuals get two mutations, some one, some maybe three. I don't want however one individual getting a huge amount of mutations and many individuals getting no mutations Notably, some mutations will change the letter into the same letter, and so for a large population size N, the expected average number of mutations is slightly less than the assigned mutrate.
I believe the answer has something to do with the ability to use the square-bracket subsetting method to obtain one random element from every column of the matrix mat. However, I could not find any information about how to use the syntax to isolate one random element from every column of a matrix. mat[sample(1:M),sample(1:N)] obviously gives you the whole matrix... perhaps I am missing something stupidly clear here.
Any help is greatly appreciated !
To answer your last question first; you can access a single cell in a matrix with mat[row,column], or multiple scattered cells by their sequential cell id. Cell 1,1 is the first cell, followed by 2,1, 3,1, etc:
mat <- matrix(rep(0, 5*5), nrow=5)
mat[c(1,3,5,7,9)] = c(1,2,3,4,5)
mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 4 0 0 0
[3,] 2 0 0 0 0
[4,] 0 5 0 0 0
[5,] 3 0 0 0 0
Accessing / overwriting the individual cells is fast too however. The fastest way that I could think of to perform your task, is to first create vectors for the values we want. A vector of all column indices (every column as many times as mutrate), a vector of row indices (randomly), and a vector of new values for these column/row combinations (randomly).
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
for(i in seq_len(N*mutrate)) {
mat[rows[i],cols[i]] = values[i]
}
Instead of that for-loop to update the matrix, we can also calculate the cell-IDs so we can update all matrix cells in one go:
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
cellid = rows + (cols-1)*M
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
mat[cellid] = values
Trying with a 6000x10000 matrix to benchmark the multiple methods, shows how fast each method is:
N = 6000 # individuals
M = 10000 # genotype length
genotypes = 20
mutrate = 2
method1 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
for(i in 1:(N*mutrate)){
position <- sample(M, 1)
letter <- sample(genotypes, 1) - 1
mat[position,i%%N] = letter
}
return(mat)
}
method2 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
mat <- apply(mat,2,function(x) (x+sample(c(rep(0,M-mutrate),sample(0:19,size=mutrate))%%20)))
}
method3 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
for(i in seq_len(N*mutrate)) {
mat[rows[i],cols[i]] = values[i]
}
return(mat)
}
method4 <- function() {
mat <- matrix(rep(0,M*N),nrow=M)
cols = rep(seq_len(N), mutrate)
rows = sample(M, N*mutrate, replace = T)
cellid = rows + (cols-1)*M
values = sample(genotypes, N*mutrate, replace = T) - 1 # -1 offset since genotypes are 0-indexed
mat[cellid] = values
return(mat)
}
benchmark <- function(func, times=10) {
begin <- as.numeric(Sys.time())
for(i in seq_len(times))
retval <- eval(parse(text=func))
end <- as.numeric(Sys.time())
cat(func, 'took', (end-begin)/times, 'seconds\n')
return(retval)
}
ret1 <- benchmark('method1()')
ret2 <- benchmark('method2()')
ret3 <- benchmark('method3()')
ret4 <- benchmark('method4()')
I've modified your first method to speed it up and perform mutrate.
method1() took 0.8936087 seconds
method2() took 8.767686 seconds
method3() took 0.7008878 seconds
method4() took 0.6548331 seconds
Related
I want to create a function which replaces the a chosen row of a matrix with zeros. I try to think of the matrix as arbitrary but for this example I have done it with a sample 3x3 matrix with the numbers 1-9, called a_matrix
1 4 7
2 5 8
3 6 9
I have done:
zero_row <- function(M, n){
n <- c(0,0,0)
M*n
}
And then I have set the matrix and tried to get my desired result by using my zero_row function
mat1 <- a_matrix
zero_row(M = mat1, n = 1)
zero_row(M = mat1, n = 2)
zero_row(M = mat1, n = 3)
However, right now all I get is a matrix with only zeros, which I do understand why. But if I instead change the vector n to one of the following
n <- c(0,1,1)
n <- c(1,0,1)
n <- c(1,1,0)
I get my desired result for when n=1, n=2, n=3 separately. But what i want is, depending on which n I put in, I get that row to zero, so I have a function that does it for every different n, instead of me having to change the vector for every separate n. So that I get (n=2 for example)
1 4 7
0 0 0
3 6 9
And is it better to do it in another form, instead of using vectors?
Here is a way.
zero_row <- function(M, n){
stopifnot(n <= nrow(M))
M[n, ] <- 0
M
}
A <- matrix(1:9, nrow = 3)
zero_row(A, 1)
zero_row(A, 2)
zero_row(A, 3)
Good afternoon ,
I have developped this R function that hashes data in buckets :
# The used packages
library("pacman")
pacman::p_load(dplyr, tidyr, devtools, MASS, pracma, mvtnorm, interval, intervals)
pacman::p_load(sprof, RDocumentation, helpRFunctions, foreach , philentropy , Rcpp , RcppAlgos)
hash<-function(v,p){
if(dot(v,p)>0) return(1) else (0) }
LSH_Band<-function(data,K ){
# We retrieve numerical columns of data
t<-list.df.var.types(data)
df.r<-as.matrix(data[c(t$numeric,t$Intervals)])
n=nrow(df.r)
# we create K*K matrice using normal law
rn=array(rnorm(K*K,0,1),c(K,K))
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd=unique.array(array(unique(ceiling(runif(K*K,0,ncol(df.r)))),c(K,K)))
buckets<-array(NA,c(K,n))
for (i in 1:K) {
for (j in 1:n) {
buckets[i,j]<-hash(df.r[j,][rd[,i]],rn[,i])
}
}
return(buckets)
}
> df.r
age height salaire.1 salaire.2
1 27 180 0 5000
2 26 178 0 5000
3 30 190 7000 10000
4 31 185 7000 10000
5 31 187 7000 10000
6 38 160 10000 15000
7 39 158 10000 15000
> LSH_Band(df.r, 3 )
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 1 1 1 1 1 1
[2,] 1 1 0 0 0 0 0
[3,] 0 0 0 0 0 0 0
The dot function is the scalar product of two vectors.
My Lsh function takes a row of my data , then it takes a part of the
obtained row using df.r[j,][rd[,i]] . df.r[j,] is j-éme row of the
data.
rd[,i] : rd is a K*K matrix of integers between 1 and ncol(df.r) , each column of the matrix contains only unique integers.
rn[,i] : rn is a K*K matrix that contains values of N(0,1) law.
In the resulting table , observations are represented in columns . I will have k Rows. For the last row , i will compute the scalar product between df.r[j,][rd[,K]] and rn[,K]. I will obtain 1 if the scalar product is positive. rd[,K] and rn[,K] will be used only for the last row in the resulting table and for all observations in that row.
My question :
Is it to replace the loops with variables i and j by a lapply function ?
My real data will be large , this is why i'm asking this question.
Thank you !
The following is a bit too long as a comment, so here are some pointers/issues/remarks:
First off, I have to say I struggle to understand what LHS_Band does. Perhaps some context would help here.
I don't understand the purpose of certain functions like helpRFunctions::list.df.var.type which simply seems to return the column names of data in a list. Note also that t$Intervals returns NULL based on the sample data you give. So I'm not sure what's going on there.
I don't see the point of function pracma::dot either. The dot product between two vectors can be calculated in base R using %*%. There's really no need for an additional package.
Function hash can be written more compactly as
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
This avoids the if conditional which is slow.
Notwithstanding my lack of understanding what it is you're trying to do, here are some tweaks to your code
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
LSH_Band <- function(data, K, seed = NULL) {
# We retrieve numerical columns of data
data <- as.matrix(data[sapply(data, is.numeric)])
# we create K*K matrice using normal law
if (!is.null(seed)) set.seed(seed)
rn <- matrix(rnorm(K * K, 0, 1), nrow = K, ncol = K)
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd <- sapply(seq_len(K), function(col) sample.int(ncol(data), K))
buckets <- matrix(NA, nrow = K, ncol = nrow(data))
for (i in 1:K) {
buckets[i, ] <- apply(data, 1, function(row) hash(row[rd[, i]], rn[, i]))
}
buckets
}
Always add an option to use a reproducible seed when working with random numbers. That will make debugging a lot easier.
You can replace at least one for loop with apply (which when using MARGIN = 1 iterates through the rows of a matrix (or array)).
I've removed all the unnecessary package dependencies, and replaced the functionality with base R functions.
I am going to simulate a vector with 100 elements in R. The vector only contains numeric values 0, 1 or 2. I only know the sum of the vector. For example, if the sum of the vector is 30, the total numbers of 0 can be 77, the total numbers of 1 can be 16, the total numbers of 2 can be 7. How can I simulate such a vector in R based on the sum of the vector?
Here is one pretty simple attempt to solve this problem. Instead of sampling all 100 elements, it makes use of the fact that there must be at least 100 - target zeros. I think there might also be a way to use the fact there can be at most 100 - (target / 2) zeros (if all the nonzero elements are 2).
sim_freq = function(target, total_size = 100, max_attempts = 100) {
min_zeros = total_size - target
target_found = FALSE
attempts = 0
while (! target_found) {
alleles = sample(0:2, size = target, replace = TRUE)
target_found = sum(alleles) == target
attempts = attempts + 1
if (attempts > max_attempts) {
stop("Couldn't find a match")
}
}
print(paste0("Found a match in ", attempts, " attempts."))
# Shuffle the generated alleles and zeros together
sample(c(alleles, rep(0, min_zeros)))
}
Usage:
sim_freq(26)
sim_freq(77)
In my test runs with targets of 26 and 77, it generally finds a vector that has the desired sum in < 20 attempts, but that might vary a lot for different targets.
Here you have some code to do it, I did it for 15 elements to calculate it faster:
x <- 0:2 #values you desire in the vector
y <- 10 #desired sum of the vector
b <- 0 #inizialize b
#until the sum of the elements is equal to the desired sum
while (b != y) {
a = sample(x,15,replace = TRUE) #calculate a random vector of 15 elements
b = sum(a) #sum of the elements
}
a #desired vector
I am trying to simulate an unlikely situation in a videogame using a Monte Carlo simulation. I'm extremely new at coding and thought this would be a fun situation to simulate.
There are 3 targets and they are being attacked 8 times independently. My problem comes with how to deal with the fact that one of the columns cannot be attacked more than 6 times, when there are 8 attacks.
I would like to take any attack aimed at column 2 select one of the other 2 columns at random to attack instead, but only if column 2 has been attacked 6 times already.
Here is my attempt to simulate with 5000 repeats, for example.
#determine number of repeats
trial <- 5000
#create matrix with a row for each trial
m <- matrix(0, nrow = trial, ncol = 3)
#The first for loop is for each row
#The second for loop runs each attack independently, sampling 1:3 at random, then adding one to that position of the row.
#The function that is called by ifelse() when m[trial, 2] > 6 = TRUE is the issue.
for (trial in 1:trial){
for (attack in 1:8) {
target <- sample(1:3, 1)
m[trial, target] <- m[trial, target] + 1
ifelse(m[trial, 2] > 6, #determines if the value of column 2 is greater than 6 after each attack
function(m){
m[trial, 2] <- m[trial, 2] - 1 #subtract the value from the second column to return it to 6
newtarget <- sample(c(1,3), 1) #select either column 1 or 3 as a new target at random
m[trial, newtarget] <- m[trial, newtarget] + 1 #add 1 to indicate the new target has been selected
m}, #return the matrix after modification
m) #do nothing if the value of the second column is <= 6
}
}
For example, if I have the matrix below:
> matrix(c(2,1,5,7,1,0), nrow = 2, ncol = 3)
[,1] [,2] [,3]
[1,] 2 5 1
[2,] 1 7 0
I would like the function to look at the 2nd line of the matrix, subtract 1 from 7, and then add 1 to either column 1 or 3 to create c(2,6,0) or c(1,6,1). I would like to learn how to do this within the loop, but it could be done afterwards as well.
I think I am making serious, fundamental error with how to use function(x) or ifelse.
Thank you.
Here's an improved version of your code:
set.seed(1)
trial <- 5000
#create matrix with a row for each trial
m <- matrix(0, nrow = trial, ncol = 3)
#The first for loop is for each row
#The second for loop runs each attack independently, sampling 1:3 at random, then adding one to that position of the row.
#The function that is called by ifelse() when m[trial, 2] > 6 = TRUE is the issue.
for (i in 1:trial){
for (attack in 1:8) {
target <- sample(1:3, 1)
m[i, target] <- m[i, target] + 1
#determines if the value of column 2 is greater than 6 after each attack
if(m[i, 2] > 6){
#subtract the value from the second column to return it to 6
m[i, 2] <- m[i, 2] - 1
#select either column 1 or 3 as a new target at random
newtarget <- sample(c(1,3), 1)
#add 1 to indicate the new target has been selected
m[i, newtarget] <- m[i, newtarget] + 1
}
}
}
# Notice the largest value in column 2 is no greater than 6.
apply(m, 2, max)
set.seed is used to make the results reproducible (usually just used for testing). The ifelse function has a different purpose than the normal if-else control flow. Here's an example:
x = runif(100)
ifelse(x < 0.5, 0, x)
You'll notice any element in x that is less than 0.5 is now zero. I changed your code to have an if block. Notice that m[i, 2] > 6 returns a single TRUE or FALSE whereas in the small example above, x < 0.5 a vector of logicals is returned. So ifelse can take a vector of logicals, but the if block requires there be only a single logical.
You were on the right track with using function, but it just isn't necessary in this case. Often, but not always, you'll define a function like this:
f = function(x)
x^2
But just returning the value doesn't mean what you want is changed:
x = 5
f(5) # 25
x # still 5
For more on this, look up function scope in R.
Lastly, I changed the loop to be i in 1:trial instead of trial in 1:trial. You probably wouldn't notice any issues in your case, but it is better practice to use a separate variable than that which makes up the range of the loop.
Hope this helps.
P.S. R isn't really known for it's speed when looping. If you want to make things goes faster, you'll typically need to vectorize your code.
I have a large mxn matrix, and I have identified the linearly dependent columns. However, I want to know if there's a way in R to write the linearly dependent columns in terms of the linearly independent ones. Since it's a large matrix, it's not possible to do based on inspection.
Here's a toy example of the type of matrix I have.
> mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 1 1 0 1 0
[2,] 1 1 0 0 1
[3,] 1 0 1 1 0
[4,] 1 0 1 0 1
Here it's obvious that x3 = x1-x2, x5=x1-x4. I want to know if there's an automated way to get that for a larger matrix.
Thanks!
I'm sure there is a better way but I felt like playing around with this. I basically do a check at the beginning to see if the input matrix is full column rank to avoid unnecessary computation in case it is full rank. After that I start with the first two columns and check if that submatrix is of full column rank, if it is then I check the first thee columns and so on. Once we find some submatrix that isn't of full column rank I regress the last column in that submatrix on the previous one which tells us how to construct linear combinations of the first columns to get the last column.
My function isn't very clean right now and could do some additional checking but at least it's a start.
mat <- matrix(c(1,1,0,1,0,1,1,0,0,1,1,0,1,1,0,1,0,1,0,1), byrow=TRUE, ncol=5, nrow=4)
linfinder <- function(mat){
# If the matrix is full rank then we're done
if(qr(mat)$rank == ncol(mat)){
print("Matrix is of full rank")
return(invisible(seq(ncol(mat))))
}
m <- ncol(mat)
# cols keeps track of which columns are linearly independent
cols <- 1
for(i in seq(2, m)){
ids <- c(cols, i)
mymat <- mat[, ids]
if(qr(mymat)$rank != length(ids)){
# Regression the column of interest on the previous
# columns to figure out the relationship
o <- lm(mat[,i] ~ mat[,cols] + 0)
# Construct the output message
start <- paste0("Column_", i, " = ")
# Which coefs are nonzero
nz <- !(abs(coef(o)) <= .Machine$double.eps^0.5)
tmp <- paste("Column", cols[nz], sep = "_")
vals <- paste(coef(o)[nz], tmp, sep = "*", collapse = " + ")
message <- paste0(start, vals)
print(message)
}else{
# If the matrix subset was of full rank
# then the newest column in linearly independent
# so add it to the cols list
cols <- ids
}
}
return(invisible(cols))
}
linfinder(mat)
which gives
> linfinder(mat)
[1] "Column_3 = 1*Column_1 + -1*Column_2"
[1] "Column_5 = 1*Column_1 + -1*Column_4"