How to optimize an iterative function for big data analysis? - r

I have a problem of optimizing a model. My function increments the value of a variable (Dem) in an iterative process to arrive at the condition set in the "WHILE".
I had to use a "FOR's" and some "IF's", I know that makes the very slow processing in the R environment, but I have to do in R.
 
The variable P is the length of 10958 obs. The variables A and C has a length of 65511 obs.
Using system.time (myfunction), using only one element of the variables Area [1] and C [1], my computer takes 2.5 seconds to complete the process. But for all elements of Area and C will take 45 hours.
My professor said it's too slow, but I think for the amount of data is normal, there is a way to optimize this? Should a option optimize the function (PSO,DEoptim,etc) instead using WHILE?
myfunction = function(P,Area,C,Cap,Inc){
Vin<- Cap
Q<-NA
Ov<-NA
Def<-NA
Vf<-NA
Vp<-NA
Dem<-0
Dem_100<-NA
Fail<-0
for (i in 1:length(Area)){
while(Fail==0){
Dem<-Dem+Inc
for (j in 1:length(P)){
#-----------------------------------------------------------------------#
####################### Calculate Q #####################################
#-----------------------------------------------------------------------#
if (P[j]==0){
Q<-0
}else{
Q<-P[j]*Area[i]*C[i]
}
#-----------------------------------------------------------------------
################################ Calculate Vp ##########################
#-----------------------------------------------------------------------
Vp<- (Vin + Q) - Dem
if(Vp<0){
Fail<-1
break #stop For j and continue the while
}
#----------------------------------------------------------------------
###################################### Calculate OV ###################
#----------------------------------------------------------------------
if (Vp>Cap){
Ov<-Vp-Cap
}else{
Ov<-0
}
#---------------------------------------------------------------------
######################################## Calculate Def ###############
#---------------------------------------------------------------------
if (Vp<0){
Def<-0-Vp
}else{
Def<-0
}
#---------------------------------------------------------------------#
################################## Calculate Vf ###########
#---------------------------------------------------------------------#
if (Vp>Cap){
Vf<-Cap
}else{
if (Vp<0) {
Vf<-0
}else{
Vf<-Vp
}
}
#-----------------------------------------------------------------------#
################################## Update Vin ###########
#-----------------------------------------------------------------------#
Vin<-Vf
}
Vin<- Cap # Reset the var Vin for new j
}
Dem_100[i]<-Dem-Inc
Def<-NA
Dem<-0
Vin<- Cap
Fail<-0
}
return(list(DemGar100=Dem_100))
}
Test for time process
P<-abs(rnorm(10958))
system.time(myfunction(P = P,Area = 100,C = 0.8,Cap = 10000,Inc = 1))
user system elapsed
2.45 0.00 2.50

Don't have enough rep to comment, but since not a full answer it should go there.
Did you think to replace some ifs with ifelse? That should speed it up
Eg, you could rplace the whole j-loop with something like:
for (i in 1:length(Area)){
while(Fail==0){
Dem<-Dem+Inc
Q <- ifelse(P==0,0,P*Area[i]*C[i]) ##note Q is a vector of length length(P)
...
Also I think that Def seems to be never calculated (and Vf is always Vp when Vp<=Cap), because if Vp<0 you jump out of the j-loop and maybe even the while (you set fail to 1, but I do not know when R checks the condition, at end of the cycle?Or the beginning)

Related

Outlier for function runtime in R

I am trying to look at system runtime for computationally heavy math functions, but when I run my code I end up with a outlier at n=13.
Wilsons Theorem Runtime in R
(I can't upload photos directly yet)
wilson_r <- (function(x) factorial(x-1)%%x==x-1)
r_wilson_runtime <- c(1:22)
#R cannot compute `wilson_r(23)` or any $n>22$. As R has a 64 bit limit and $log_2(23!)>64$.
for (x in c(1:22)){
holder_times <- c(1:10000)
for (y in c(1:10000)){
start_time <- as.numeric(Sys.time())
wilson_r(x)
end_time <- as.numeric(Sys.time())
holder_times[y]<- end_time-start_time
}
r_wilson_runtime[x] <-mean(holder_times*(10**6))
}
I have tried knitting the document several times, and the outlier remains. Is there a particular reason for the oultier?
The result can be sometimes noisy. If it always happens at the same n (be sure knitr is regenerating the whole document) it is just a coincidence. You can easily get rid of the noise (outstanding measurements) in your example by taking a median not mean.
That said, R has a special function system.time, which is designed for measuring time of execution. It is also better to include the inner repetition loop inside of the measurement, like this:
wilson_r <- (function(x) factorial(x-1)%%x==x-1)
r_wilson_n = 1:22
r_wilson_runtime = sapply(r_wilson_n, function(x) {
N = 100000
ret = system.time({for (y in c(1:N)) wilson_r(x)})
1e6*ret[1]/N
})
plot(r_wilson_n, r_wilson_runtime)
Nevertheless, the result can be still sometimes noisy for such cheap functions (R is a language with automatic gc).
As for your wilson_r for higher n, it is not a good idea to use large integers if you use modulo at the end. It is better to do a modulo at every multiplication. You can use the inline package to make a small C function, to calculate this efficiently:
factorial_modulo = inline::cfunction(
signature(v="integer"),
" int n=v[0], ret=1, i;
for (i=2; i<n; i++)
ret = (ret * i) % n;
v[0] = ret;",
convention=".C")
wilson_r <- (function(x) factorial_modulo(x)==x-1)

Get out of infinite while loop

What is the best way to have a while loop recognize when it is stuck in an infinite loop in R?
Here's my situation:
diff_val = Inf
last_val = 0
while(diff_val > 0.1){
### calculate val from data subset that is greater than the previous iteration's val
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val) ### how much did this change val?
last_val = val ### set last_val for the next iteration
}
The goal is to have val get progressively closer and closer to a stable value, and when val is within 0.1 of the val from the last iteration, then it is deemed sufficiently stable and is released from the while loop. My problem is that with some data sets, val gets stuck alternating back and forth between two values. For example, iterating back and forth between 27.0 and 27.7. Thus, it never stabilizes. How can I break the while loop if this occurs?
I know of break but do not know how to tell the loop when to use it. I imagine holding onto the value from two iterations before would work, but I do not know of a way to keep values two iterations ago...
while(diff_val > 0.1){
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val)
last_val = val
if(val == val_2_iterations_ago) break
}
How can I create val_2_iterations_ago?
Apologies for the non-reproducible code. The real foo() and data that are needed to replicate the situation are not mine to share... they aren't key to figuring out this issue with control flow, though.
I don't know if just keeping track of the previous two iterations will actually suffice, but it isn't too much trouble to add logic for this.
The logic is that at each iteration, the second to last value becomes the last value, the last value becomes the current value, and the current value is derived from foo(). Consider this code:
while (diff_val > 0.1) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
}
Another approach, perhaps a little more general, would be to track your iterations and set a maximum.
Pairing this with Tim's nice answer:
iter = 0
max_iter = 1e6
while (diff_val > 0.1 & iter < max_iter) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
iter = iter + 1
}
How this is generally done is that you have:
A convergence tolerance, so that when your objective function doesn't change appreciably, the algorithm is deemed to have converged
A limit on the number of iterations, so that the code is guaranteed to terminate eventually
A check that the objective function is actually decreasing, to catch the situation where it's diverging/cyclic (many optimisation algorithms are designed so this shouldn't happen, but in your case it does happen)
Pseudocode:
oldVal <- Inf
for(i in 1:NITERS)
{
val <- objective(x)
diffVal <- val - oldVal
converged <- (diffVal <= 0 && abs(diffVal) < TOL)
if(converged || diffVal > 0)
break
oldVal <- val
}

Tail recursion in R

I seem to misunderstand tail recursion; according to this stackoverflow question R does not support tail recursion. However, let's consider the following functions to compute the nth fibonacci number:
Iterative version:
Fibo <- function(n){
a <- 0
b <- 1
for (i in 1:n){
temp <- b
b <- a
a <- a + temp
}
return(a)
}
"Naive" recursive version:
FiboRecur <- function(n){
if (n == 0 || n == 1){
return(n)
} else {
return(FiboRecur(n-1) + FiboRecur(n-2))
}
}
And finally an example I found that should be tail call recursive:
FiboRecurTail <- function(n){
fib_help <- function(a, b, n){
if(n > 0){
return(fib_help(b, a+b, n-1))
} else {
return(a)
}
}
return(fib_help(0, 1, n))
}
Now if we take a look at the traces when these functions are called, here is what we get:
Fibo(25)
trace: Fibo(25)
[1] 75025
trace(FiboRecur)
FiboRecur(25)
Thousands of calls to FiboRecur and takes a lot of time to run
FiboRecurTail(25)
trace: FiboRecurTail(25)
[1] 75025
In the cases of Fibo(25) and FiboRecurTail(25), the answer is displayed instantaneously and only one call is made. For FiboRecur(25), thousands of calls are made and it runs for some seconds before showing the result.
We can also take a look at the run times using the benchmark function from the package rbenchmark:
benchmark(Fibo(30), FiboRecur(30), FiboRecurTail(30), replications = 5)
test replications elapsed relative user.self sys.self user.child sys.child
1 Fibo(30) 5 0.00 NA 0.000 0 0 0
2 FiboRecur(30) 5 13.79 NA 13.792 0 0 0
3 FiboRecurTail(30) 5 0.00 NA 0.000 0 0 0
So if R does not support tail recursion, what is happening in FiboRecurTail(25) that makes it run as fast as the iterative version while the "naive" recursive function runs like molasses? Is it rather that R supports tail recursion, but does not optimize a "naive" recursive version of a function to be tail-call recursive like other programming languages (Haskell for instance) do? This is what I understand from this post in R's mailing list.
I would greatly appreciate if someone would shed some light into this. Thanks!
The difference is that for each recursion, FiboRecur calls itself twice. Within FiboRecurTail, fib_help calls itself only once.
Thus you have a whole lot more function calls with the former. In the case of FiboRecurTail(25) you have a recursion depth of ~25 calls. FiboRecur(25) results in 242,785 function calls (including the first).
I didn't time any of the routines, but note that you show 0.00 for both of the faster routines. You should see some difference with a higher input value, but note that Fibo iterates exactly as much as FiboRecurTail recurses.
In the naive recursive approach, you repetitively calculated a lot of values. For example, when you calculate FiboRecur(30) you will calculate FiboRecur(29) and FiboRecur(28), and each of these two calls are independent. And in FiboRecur(29) you will calculate FiboRecur(28) again and FiboRecur(27) even though FiboRecur(28) has already been calculated somewhere else as above. And this happens for every stage of recursion. Or simply put, for every increase of n, the calculation effort almost doubles but obviously, in reality it should just be as simple as add the last two calculated numbers together.
A little summary of FiboRecur(4): FiboRecur(0) is calculated twice, FiboRecur(1) is calculated three times, FiboRecur(2) is calculated twice and FiboRecur(3) is calculated once. The former three should really be calculated once and stored somewhere so that you can extract the values whenever they are needed. And that's why you see so many function calls even though it's not a large number.
In the tail recursive version, however, every previously calculated values are passed to the next stage via a + b parameter, which avoids countless repetitive calculations as in the naive recursive version, and thus more efficient.
The following algorithm uses accumulator parameter technique to make things tail recursive, then wraps it in a memoization function.
Number of function calls shouldn't necessarily differ for tail-recursion. This is mostly about managing stack memory, not speed. Every call to fib(n) generates calls to fib(n - 1) and fib(n - 2), expect in tail-recursive cases, the stack frame is reused rather than a new one being allocated for each call.
Memoization is what gives a speed-boost. Results are cached for future use.
library(hash)
# Generate Fibonacci numbers
# Tail Recursive Algorithm using Accumulator Parameter Technique
fibTR <- function(n) {
fibLoop <- function(acc, m, k) {
if (k == 0)
acc
else
fibLoop(acc = m, m = acc + m, k = k - 1)
}
fibLoop(acc = 0, m = 1, k = n)
}
# A generic memoization function for function fn taking integer input
memoize <- function(fn, inp) {
cache <- hash::hash()
key <- as.character(inp)
if (hash::has.key(key = key, hash = cache))
cache[[key]]
else {
cache[[key]] <- inp %>% fn
cache[[key]]
}
}
# Partial Application of a Function
# Memoized and Tail Recursive Fibonacci Number Generator
fib <- partial(.f = memoize, fn = fibTR)
# Get the first 10 Fibonacci numbers
map(.x = 0:9, .f = fib) %>% unlist
Running fibAux(10000) yields
Error: C stack usage 15927040 is too close to the limit
So, I doubt R does efficient tail call optimization.
Another issue is the construction of the cache or lookaside table. In functional languages such as Haskell, ML, ..., that intermediary data structures get built when you first partially call the function. Assuming the same effect in R, another issue is that memory allocation in R is very expensive so is growing vectors, matrices, etc: Here, we are growing a dictionary, and if we pre-allocate the dictionary of appropriate size, then we have to supply the n argument and the cache gets constructed every time we call the function which defeats the purpose.
// Here is F# code to do the same:
// A generate Fibonacci numbers: Tail Recursive Algorithm
let fibTR n =
let rec fibLoop acc m k =
match k with
| 0 -> acc
| n -> fibLoop m (acc + m) (n - 1)
fibLoop 0 1 n
// A generic memoization function
let memoize (fn: 'T -> 'U) =
let cache = new System.Collections.Generic.Dictionary<_, _>()
fun inp ->
match cache.TryGetValue inp with
| true, res -> res
| false, _ ->
let res = inp |> fn
cache.Add(inp, res)
res
// A tail recursive and
let fib = fibTR |> memoize
// Get the first 10 Fibonacci numbers
[ 0..9 ] |> List.map fib

How to create an efficient for loop to resolve the rate limit issue with twitteR?

I am quite new to TwitteR and the concept of for loop. I have come across to this code to get the followers and profiles.
This code below works fine. Not entirely sure if retry on rate limit should be set for such a long time.
#This extracts all or most followers.
followers<-getUser("twitter_handle_here")$getFollowerIDs(retryOnRateLimit=9999999)
This code below is the for loop to get the profiles.
However, I think there should be a way to use length(followers) and getCurRateLimitInfo() to better contruct the loop.
My question is that if the length(followers) = 40000 and the ratelimit = 180, then how to construct the loop to sleep with the right amount of time and to get all 40000 twitter profiles?
Any help would be much appreciated.
#This is the for loop to sleep for 5 seconds.
#Problem with this is it simply sleeps for X seconds
for (follower in followers){
Sys.sleep(5)
followers_info<-lookupUsers(followers)
followers_full<-twListToDF(followers_info)
}
Here is some code I had written for a similar purpose, First you need to define this function stall_rate_limit:
stall_rate_limit <- function(limit) {
# Store the record of all the rate limits into rate
rate = getCurRateLimitInfo()
message("Checking Rate Limit")
if(any(as.numeric(rate[,3]) == 0)) {
# Get the locations of API Calls that are used up
index = which(as.numeric(rate[,3]) == 0)
# get the time till when rates limits Reset
wait = as.POSIXct(min(rate[index,4]), ## Reset times in the 4th col
origin = "1970-01-01", ## Origin of Unix Time
tz = "US/Mountain") ## Replace with your Timezone
message(paste("Waiting until", wait,"for Godot to reset rate limit"))
# Tell the computer to sleep until the rates reset
Sys.sleep(difftime(wait, Sys.time(), units = "secs"))
# Set J = to 0
J = 0
# Return J as a counter
return(J)
} else {
# Count was off, Try again
J = limit - 1
return(J)
}
}
Then you can run your code something like this:
callsMade = 0 ## This is your counter to count how many calls were made
limit = 180 ## the Limit of how many calls you can make
for(i in 1:length(followers)){
# Check to see if you have exceeded your limit
if(callsMade >= limit){
# If you have exceeded your limit, wait and set calls made to 0
callsMade = stall_rate_limit(limit)
}
### Execute your Code Here ... ###
callsMade = callsMade + 1 # or however many calls you have made
}

Incorporating a stop function in a random walk

In my previous question:How do I put arena limits on a random walk? the community helped create a random walk function in a set arena. This function is designed to simulate a fish moving through an area, but now I need to make it decide when to stop when a certain condition is satisfied.
I thought it would be as simple as
{{if(z>P)break}} put in just before the loop function. What I want it to understand is "if this condition is satisfied then stop, otherwise keep going until you reach the maximum number of steps.
Instead it caused my random walk to become deterministic (I always get the same path and it never stops before step.max).
Main question: How do I tell the random walk to stop if z>P?
For reference:
step.max<-125
step.prob<-function(n.times=step.max){
draw=sample(0:100,1,replace=T)
CS<-sample(draw,size=1,replace=TRUE)
CS.max<-100
step.num<-15
SP<-((CS/CS.max)*(1-(step.num/step.max))+(step.num/step.max))*100
if(SP>P){stop('Settled at step number',P)}else{SP
}
}
z<-step.prob(1) #renaming the above function to be easier to reference later
P<-80 #preset cutoff point for value z, ranges from 0-100
walkE <- function(n.times=125,
xlim=c(524058,542800),
ylim=c(2799758,2818500),
start=c(525000,2810000),
stepsize=c(4000,4000)) {
plot(c(0,0),type="n",xlim=xlim,ylim=ylim,
xlab="Easting",ylab="Northing")
x <- start[1]
y <- start[2]
steps <- 1/c(1,2,4,8,12,16)
steps.y <- c(steps,-steps,0)
steps.x <- c(steps,-steps[c(1,5,6)],0)
points(x,y,pch=16,col="red",cex=1)
for (i in 1:n.times) {
repeat {
xi <- stepsize[1]*sample(steps.x,1)
yi <- stepsize[2]*sample(steps.y,1)
newx <- x+xi
newy <- y+yi
if (newx>xlim[1] && newx<xlim[2] &&
newy>ylim[1] && newy<ylim[2]) break
}
lines(c(x,newx),c(y,newy),col="blue")
x <- newx
y <- newy
if(z>P){stop(points(newx,newy,col="green",cex=1))}
#this is where I want it to stop if z>P
else
if(z<P){points(newx,newy,pch=1,col="blue",cex=1)}
else
if(step.max){points(newx,newy,pch=16,col="green",cex=1)}
set.seed(101)}
}
walkE(step.max) #run above random walk function walkE looped for the step.max number
Thanks in advance!!!
This is pretty easy and can be accomplished by inserting a stop(...) function in your user defined step.prob function.
step.prob<-function(n.times=step.max, p){
draw=sample(0:100,1,replace=T)
CS<-sample(draw,size=1,replace=TRUE)
CS.max<-100
CS.max
step.num<-15
SP<-((CS/CS.max)*(1-(step.num/step.max))+(step.num/step.max))*100
if(SP > p) {
stop('Your random walk exceeded ', p)
} else {
SP
}
}
If this doesn't do it for you look into the break command.
So, when the random walk value is > p:
step.prob(p=300000)
# Error in step.prob(p = 3) : Your random walk exceeded 3
And if you want to set the value returned by the function to p you can just add in SP <- p before the stop command.

Resources