Optimization of speed in R - r

I am currently working on a function that works on a big matrix of 2 columns ( number of values > 2000 in general) and have a time problem.
here the head of my matrix :
matrix
here my function :
get <- function()
{
v <- sample(1:1e6,20000, replace=TRUE) #for example
table <- #mymatrix
for ( i in 1:nrow(table))
{
b <- which(v > table[i,1] & v < table[i,2]) #want index between 2 intervals
}
return(b)
}
the problem is the which it is too long when I repeat my loop for the whole table, and i can't find how to fix it (still learning in R).

As Andrey said in a comment, you’re only returning the result for the last row. You’re also not passing table into the function (in fact, your function has no arguments), and it’s also unclear what v represents and in particular why it has more values than table has rows.
However, assuming that you want the results for all rows, you can do two things:
Don’t use which, you probably don’t need numeric indices.
Use vectorisation instead of a for loop:
get = function(table) {
v = sample(1 : 1E6, 20000, replace = TRUE)
v > table[, 1] & v < table[, 2]
}
That’s it.

Here is the code that would, for every value in vector v tell you which of the bins it fell into.
tbl = matrix(c(0,224,
225,233,
234,239,
240,243,
244,290,
291,292),
byrow = TRUE,
ncol = 2);
v = c(0,100,224,
225, 230, 233,
235)
fi1 = findInterval(v, tbl[,1]+1)
fi2 = findInterval(v, tbl[,2]-1)
set = (fi1!=fi2)
b = double(length(v))
b[set] = fi1[set];
# show the results
cbind(value = v, bin = b)
# value bin
# [1,] 0 0
# [2,] 100 1
# [3,] 224 0
# [4,] 225 0
# [5,] 230 2
# [6,] 233 0
# [7,] 235 3

Related

Generating "Non-Random" Numbers in R?

I know how to generate 100 random numbers in R (without replacement):
random_numbers = sample.int(100, 100, replace = FALSE)
I was now curious about learning how to generate 100 "non random" numbers (without replacement). The first comes to mind is to generate a random number, and the next number will be the old number + 1 with a probability of 0.5 or an actual random number with probability 0.5. Thus, these numbers are not "fully random".
This was my attempt to write this code for numbers in a range of 0 to 100 (suppose I want to repeat this procedure 100 times):
library(dplyr)
all_games <- vector("list", 100)
for (i in 1:100){
index_i = i
guess_sets <- 1:100
prob_i = runif(n=1, min=1e-12, max=.9999999999)
guess_i = ifelse(prob_i> 0.5, sample.int(1, 100, replace = FALSE), guess_i + 1)
guess_sets_i <- setdiff(guess_sets_i, guess_i)
all_games_i = as.list(index_i, guess_i, all_games_i)
all_games[[i]] <- all_games_i
}
all_games <- do.call("rbind", all_games)
I tried to make a list that stores all guesses such that the range for the next guess automatically excludes numbers that have already been guessed, but I get this error:
Error in sample.int(1, 100, replace = FALSE) :
cannot take a sample larger than the population when 'replace = FALSE'
Ideally, I am trying to get the following results (format doesn't matter):
index_1 : 5,6,51,4,3,88,87,9 ...
index_2 77,78,79,2,65,3,1,99,100,4...
etc.
Can someone please show me how to do this? Are there easier ways in R to generate "non-random numbers"?
Thank you!
Note: I think an extra line of logic needs to be added - Suppose I guess the number 100, after guessing the number 100 I must guess a new random number since 100+1 is not included in the original range. Also, if I guess the number 5, 17 then 4 - and after guessing 4, the loop tells me to guess 4+1, this is impossible because 5 has already been guessed. In such a case, I would also have to guess a new random number?
It would be tricky to make your algorithm very efficient in R... it doesn't lend itself nicely to vectorization. Here's how I'd write it directly as a for loop:
semirandom = function(n) {
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
result = numeric(n)
result[1] = sample.int(n, size = 1)
for(i in 2:length(result)) {
if(runif(1) < .5 &&
result[i - 1] < n &&
!((result[i - 1] + 1) %in% result)) {
result[i] = result[i - 1] + 1
} else {
result[i] = safe_sample(x = setdiff(1:n, result), size = 1)
}
}
result
}
# generate 10 semirandom numbers 5 times
replicate(semirandom(10), n = 5)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 6 4 4 2 6
# [2,] 3 5 5 3 7
# [3,] 4 3 6 4 5
# [4,] 5 1 2 5 2
# [5,] 7 9 3 6 3
# [6,] 9 10 10 1 1
# [7,] 10 2 8 9 4
# [8,] 2 8 1 8 10
# [9,] 1 7 9 10 9
# [10,] 8 6 7 7 8
You get the error cannot take a sample larger than the population when 'replace = FALSE' because you attempt to extract 100 values from a vector of length one without replacement.
The following draws numbers between 1 and 100, draws each number not more than once, has a 50 percent chance of drawing the previous number + 1 and a 50 percent chance of drawing another random number, if the previous number + 1 has not been drawn yet, and a 100 percent chance to draw another random number, if the previous number + 1 has been drawn.
i <- sample.int(100, 1)
j <- i
for(x in 1:99) {
if((i + 1L) %in% j) {
i <- sample((1:100)[-j], 1L)
} else {
if(runif(1L) > 0.5 || i == 100L) {
i <- sample((1:100)[-j], 1L)
} else {
i <- i + 1L
}
}
j <- c(j, i)
}

R: Logical Conditions Not Being Respected

I am working with the R programming language. I am trying to build a loop that performs the following :
Step 1: Keep generating two random numbers "a" and "b" until both "a" and "b" are greater than 12
Step 2: Track how many random numbers had to be generated until it took for Step 1 to be completed
Step 3: Repeat Step 1 and Step 2 100 times
Since I do not know how to keep generating random numbers until a condition is met, I tried to generate a large amount of random numbers hoping that the condition is met (there is probably a better way to write this):
results <- list()
for (i in 1:100){
# do until break
repeat {
# repeat many random numbers
a = rnorm(10000,10,1)
b = rnorm(10000,10,1)
# does any pair meet the requirement
if (any(a > 12 & b > 12)) {
# put it in a data.frame
d_i = data.frame(a,b)
# end repeat
break
}
}
# select all rows until the first time the requirement is met
# it must be met, otherwise the loop would not have ended
d_i <- d_i[1:which(d_i$a > 10 & d_i$b > 10)[1], ]
# prep other variables and only keep last row (i.e. the row where the condition was met)
d_i$index = seq_len(nrow(d_i))
d_i$iteration = as.factor(i)
e_i = d_i[nrow(d_i),]
results[[i]] <- e_i
}
results_df <- do.call(rbind.data.frame, results)
Problem: When I look at the results, I noticed that the loop is incorrectly considering the condition to be met, for example:
head(results_df)
a b index iteration
4 10.29053 10.56263 4 1
5 10.95308 10.32236 5 2
3 10.74808 10.50135 3 3
13 11.87705 10.75067 13 4
1 10.17850 10.58678 1 5
14 10.14741 11.07238 1 6
For instance, in each one of these rows - both "a" and "b" are smaller than 12.
Does anyone know why this is happening and can someone please show me how to fix this problem?
Thanks!
How about this way? As you tag while-loop, I tried using it.
res <- matrix(0, nrow = 0, ncol = 3)
for (j in 1:100){
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- 1
while(a < 12 | b < 12) {
a <- rnorm(1, 10, 1)
b <- rnorm(1, 10, 1)
i <- i + 1
}
x <- c(a,b,i)
res <- rbind(res, x)
}
head(res)
[,1] [,2] [,3]
x 12.14232 12.08977 399
x 12.27158 12.01319 1695
x 12.57345 12.42135 302
x 12.07494 12.64841 600
x 12.03210 12.07949 82
x 12.34006 12.00365 782
dim(res)
[1] 100 3

How can i replace nested loop using lapply in R?

Good afternoon ,
I have developped this R function that hashes data in buckets :
# The used packages
library("pacman")
pacman::p_load(dplyr, tidyr, devtools, MASS, pracma, mvtnorm, interval, intervals)
pacman::p_load(sprof, RDocumentation, helpRFunctions, foreach , philentropy , Rcpp , RcppAlgos)
hash<-function(v,p){
if(dot(v,p)>0) return(1) else (0) }
LSH_Band<-function(data,K ){
# We retrieve numerical columns of data
t<-list.df.var.types(data)
df.r<-as.matrix(data[c(t$numeric,t$Intervals)])
n=nrow(df.r)
# we create K*K matrice using normal law
rn=array(rnorm(K*K,0,1),c(K,K))
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd=unique.array(array(unique(ceiling(runif(K*K,0,ncol(df.r)))),c(K,K)))
buckets<-array(NA,c(K,n))
for (i in 1:K) {
for (j in 1:n) {
buckets[i,j]<-hash(df.r[j,][rd[,i]],rn[,i])
}
}
return(buckets)
}
> df.r
age height salaire.1 salaire.2
1 27 180 0 5000
2 26 178 0 5000
3 30 190 7000 10000
4 31 185 7000 10000
5 31 187 7000 10000
6 38 160 10000 15000
7 39 158 10000 15000
> LSH_Band(df.r, 3 )
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 1 1 1 1 1 1
[2,] 1 1 0 0 0 0 0
[3,] 0 0 0 0 0 0 0
The dot function is the scalar product of two vectors.
My Lsh function takes a row of my data , then it takes a part of the
obtained row using df.r[j,][rd[,i]] . df.r[j,] is j-éme row of the
data.
rd[,i] : rd is a K*K matrix of integers between 1 and ncol(df.r) , each column of the matrix contains only unique integers.
rn[,i] : rn is a K*K matrix that contains values of N(0,1) law.
In the resulting table , observations are represented in columns . I will have k Rows. For the last row , i will compute the scalar product between df.r[j,][rd[,K]] and rn[,K]. I will obtain 1 if the scalar product is positive. rd[,K] and rn[,K] will be used only for the last row in the resulting table and for all observations in that row.
My question :
Is it to replace the loops with variables i and j by a lapply function ?
My real data will be large , this is why i'm asking this question.
Thank you !
The following is a bit too long as a comment, so here are some pointers/issues/remarks:
First off, I have to say I struggle to understand what LHS_Band does. Perhaps some context would help here.
I don't understand the purpose of certain functions like helpRFunctions::list.df.var.type which simply seems to return the column names of data in a list. Note also that t$Intervals returns NULL based on the sample data you give. So I'm not sure what's going on there.
I don't see the point of function pracma::dot either. The dot product between two vectors can be calculated in base R using %*%. There's really no need for an additional package.
Function hash can be written more compactly as
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
This avoids the if conditional which is slow.
Notwithstanding my lack of understanding what it is you're trying to do, here are some tweaks to your code
hash <- function(v, p) +(as.numeric(v %*% p) > 0)
LSH_Band <- function(data, K, seed = NULL) {
# We retrieve numerical columns of data
data <- as.matrix(data[sapply(data, is.numeric)])
# we create K*K matrice using normal law
if (!is.null(seed)) set.seed(seed)
rn <- matrix(rnorm(K * K, 0, 1), nrow = K, ncol = K)
# we create K*K matrice of integers using uniform law , integrs are unique in each column
rd <- sapply(seq_len(K), function(col) sample.int(ncol(data), K))
buckets <- matrix(NA, nrow = K, ncol = nrow(data))
for (i in 1:K) {
buckets[i, ] <- apply(data, 1, function(row) hash(row[rd[, i]], rn[, i]))
}
buckets
}
Always add an option to use a reproducible seed when working with random numbers. That will make debugging a lot easier.
You can replace at least one for loop with apply (which when using MARGIN = 1 iterates through the rows of a matrix (or array)).
I've removed all the unnecessary package dependencies, and replaced the functionality with base R functions.

Memorize the last "correct" value of a sequence (for removing outliers)

I have a little problem in a function.
The aim of it is to remove outliers I've detected in my data.frame. They are detected when there's a too big difference with the previous correct value (e.g c(1,2,3,20,30,4,5,6): "20" and "30" are the outliers). But my data is much more complex than this.
My idea is to consider the first two numeric values of my column as "correct". Then, I want to test each next value:
if the difference between the tested value and the previous one is <20, then it's a new correct one, and the test must start again from this new correct value (and not from the previous correct one)
if the same difference is >20, then it's a wrong one. An index must be put next to the wrong value, and the test must still continue from this same correct value, until a new correct value is detected
Here's an example with my function and a fake DF:
myts <- data.frame(x=c(12,12,35,39,46,45,33,5,26,28,29,34,15,15),z=NA)
test <- function(x){
st1 = NULL
temp <- st1[1] <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if((!is.na(x[i])) & (!is.na(x[i-1]))& (abs((x[i])-(temp)) > 20)){
st1[i] <- 1
} }
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
It does nearly the job, but the problem is that the last correct value is not memorized. It still does the test from the first correct value.
Due to this, rows 9 to 11 in my example are not detected. I let you imagine the problem on a 500,000 rows data.frame.
How can I solve this little problem? The rest of the function may be OK.
You just need to update temp for any indices that aren't outliers:
test <- function(x) {
temp <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if(!is.na(x[i]) & !is.na(x[i-1]) & abs(x[i]-temp) > 20) {
st1[i] <- 1
} else {
temp <- x[i]
}
}
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
myts
# x z
# 1 12 0
# 2 12 0
# 3 35 1
# 4 39 1
# 5 46 1
# 6 45 1
# 7 33 1
# 8 5 0
# 9 26 1
# 10 28 1
# 11 29 1
# 12 34 1
# 13 15 0
# 14 15 0
One thing to note is that for loops in R will be quite slow compared to vectorized functions. However, because each element in your vector depends on a complicated way on the previous ones, it's tough to use R's built-in vectorized functions to efficiently compute your vector. You can convert this code nearly verbatim to C++ and use the Rcpp package to regain the efficiency:
library(Rcpp)
test2 <- cppFunction(
"IntegerVector test2(NumericVector x) {
const int n = x.length();
IntegerVector st1(n, 0);
double temp = x[0];
for (int i=1; i < n; ++i) {
if (!R_IsNA(x[i]) && !R_IsNA(x[i]) && fabs(x[i] - temp) > 20.0) {
st1[i] = 1;
} else {
temp = x[i];
}
}
return st1;
}")
all.equal(test(myts[,1]), test2(myts[,1]))
# [1] TRUE
# Benchmark on large vector with some NA values:
set.seed(144)
large.vec <- c(0, sample(c(1:50, NA), 1000000, replace=T))
all.equal(test(large.vec), test2(large.vec))
# [1] TRUE
library(microbenchmark)
microbenchmark(test(large.vec), test2(large.vec))
# Unit: milliseconds
# expr min lq mean median uq max neval
# test(large.vec) 2343.684164 2468.873079 2667.67970 2604.22954 2747.23919 3753.54901 100
# test2(large.vec) 9.596752 9.864069 10.97127 10.23011 11.68708 16.67855 100
The Rcpp code is about 250x faster on a vector of length 1 million. Depending on your use case this speedup may or may not be important.

R: Vectorize Finite Difference Equations

I'm trying to move some Fortran code to R for finite differences related to chemical kinetics.
Sample Fortran loop:
DOUBLE PRECISION, DIMENSION (2000,2) :: data=0.0
DOUBLE PRECISION :: k1=5.0, k2=20.0, dt=0.0005
DO i=2, 2000
data(i,1) = data(i-1,1) + data(i-1,1)*(-k1)*dt
data(i,2) = data(i-1,2) + ( data(i-1,1)*k1*dt - data(i-1,2)*k2*dt )
...
END DO
The analogous R code:
k1=5
k2=20
dt=0.0005
data=data.frame(cbind(c(500,rep(0,1999)),rep(0,2000)))
a.fun=function(y){
y2=y-k1*y*dt
return(y2)
}
apply(data,2,a.fun)
This overwrites my first value in the dataframe and leaves zeros elsewhere. I'd like to run this vectorized and not using a for loop since they are so slow in R. Also, my function only calculates the first column so far. I can't get the second column working until I get the syntax right on the first.
Its not necessarily true that R is bad at loops. It very much depends on what you are doing. Using k1, k2, dt and data from the question (i.e. the four lines beginning with k1=5) and formulating the problem in terms of an iterated matrix, the loop in the last line below returns nearly instantaneously on my PC:
z <- as.matrix(data)
m <- matrix(c(1-k1*dt, k1*dt, 0, 1-k2*dt), 2)
for(i in 2:nrow(z)) z[i, ] <- m %*% z[i-1, ]
(You could also try storing the vectors in columns of z rather than rows since R stores matrices by column.)
Here is the first bit of the result:
> head(z)
X1 X2
[1,] 500.0000 0.000000
[2,] 498.7500 1.250000
[3,] 497.5031 2.484375
[4,] 496.2594 3.703289
[5,] 495.0187 4.906905
[6,] 493.7812 6.095382
May be this can help.
I think you need to have the initial condition for data[1,2]. I assumed both data[1,1] as 500 and data[1,2 as 0 at the initial condition.
The code goes like this:
> ## Define two vectors x and y
> x <- seq(from=0,length=2000,by=0)
> y <- seq(from=0,length=2000,by=0)
>
> ## Constants
> k1 = 5.0
> dt = 0.0005
> k2 = 20.0
>
> ## Initialize x[1]=500 and y[1]=0
> x[1]=500
> y[1] = 0
>
> for (i in 2:2000){
+ x[i]=x[i-1]+x[i-1]*-k1*dt
+ y[i] = y[i-1]+x[i-1]*k1*dt-y[i-1]*k2*dt
+ }
>
> finaldata <- data.frame(x,y)
> head(finaldata)
x y
1 500.0000 0.000000
2 498.7500 1.250000
3 497.5031 2.484375
4 496.2594 3.703289
5 495.0187 4.906905
6 493.7812 6.095382
I hope this helps.

Resources