I'm trying to improve the speed of my code, which is trying to optimise a value using 3 variables which have large ranges. The most likely output uses values in the middle of the ranges, so it is wasting time starting from the lowest possible value of each variable. I want to start from the middle value and iterate out! The actual problem has thousands of lines with numbers from 150-650. C,H and O limits will be defined somewhat based on the starting number, but will always be more likely at a central value in the defined range. Is there a way to define the for loop to work outwards like I want? The only, quite shabby, way I can think of is to simply redefine the value within the loop from a vector (e.g. 1=20, 2=21, 3=19, etc). See current code below:
set_error<-2.5
ct<-c(325.00214,325.00952,325.02004,325.02762,325.03535,325.03831,325.04588, 325.05641,325.06402,325.06766,325.07167,325.07454,325.10396)
FormFun<-function(x){
for(C in 1:40){
for(H in 1:80){
for(O in 1:40){
test_mass=C*12+H*1.007825+O*15.9949146-1.0072765
error<-1000000*abs(test_mass-x)/x
if(error<set_error){
result<-paste("C",C,"H",H,"O",O,sep ="")
return(result)
break;break;break;break
}
}
}
}
}
old_t <- Sys.time()
ct2<-lapply(ct,FormFun)
new_t <- Sys.time() - old_t # calculate difference
print(new_t)
Use vectorization and create a closure:
FormFun1_fac <- function(gr) {
gr <<- gr
function(x, set_error){
test_mass <- with(gr, C*12+H*1.007825+O*15.9949146-1.0072765)
error <- 1000000 * abs(test_mass - x) / x
ind <- which(error < set_error)[1]
if (is.na(ind)) return(NULL)
paste0("C", gr[ind, "C"],"H", gr[ind, "H"],"O", gr[ind, "O"])
}
}
FormFun1 <- FormFun1_fac(expand.grid(C = 1:40, H = 1:80, O = 1:40))
ct21 <- lapply(ct, FormFun1, set_error = set_error)
all.equal(ct2, ct21)
#[1] TRUE
This saves a grid of all combinations of C, H, O in the function environment and calculates the error for all combinations (which is fast in vectorized code). The first combination that passes the test is returned.
Related
Consider the function f which takes decision-tree node parameters {-1,+1} and maps it to an one-hot vector [0,0,0,1] for example.
I think this will end up being one of the bottlenecks of a program I'm working on, so I'd like to know if anyone finds a faster way to map the parameters to the vector.
f<-function(h){
# function takes as arguments:
# an m-bit vector of potential split decisions (h)
# function returns:
# an m+1-length one-hot indicator vector
theta_vec = c(rep(0,length(h)+1))
position = length(h)+1
for(bit in seq(1,length(h),2)){
if(h[bit]>0){
position=position
}
else{
position=position/2
}
}
theta_vec[position]=1
return(theta_vec)
}
Thank you for your help
I think I've got a solution that runs in a quarter of the time. Are you able to refactor so that you use (0,1) instead of (-1,1); and use it as lists of rows instead of a vector? I find its easier to interpret when thinking about the problem, although the function below could be re-written to use a vector as input.
findPos <- function(h){
# find number of rows from input
N <- length(h)
# go through and pick out the values in each tree that are valid based
# on previous route
out <- c(h[[1]], rep(0, N-1))
for(i in 2:N){
out[i] <- h[[i]][sum(out[i:(i-1)] * 2^(i-1)/(2^((i-1):1))) + 1]
}
# now find the final position in the bottom row and return as a vector
out_pos <- sum(out * 2^N/(2^(1:N))) + 1
full_vec <- rep(0, 2^N)
full_vec[out_pos] <- 1
return(full_vec)
}
# couple of e.gs
f(c(0,1,1))
findPos(list(0, c(1,1)))
f(c(1,1,1))
findPos(list(1, c(1,1)))
# works with larger trees
findPos(list(1, c(1,1), c(1,0,0,0)))
# check time using microbenchmark package
microbenchmark::microbenchmark(
"old" = {
f(c(0,1,1))
},
"new" = {
findPos(list(0, c(1,1)))
}
)
Best
Jonny
I am currently working on a script, that loads a TIF file into a raster object, crops it and plots two points (starting point and point of destination; selected via the click-function) into that raster. I then want it to get the cell numbers of those two points. All of that hasn't caused any trouble but now I have tried to write a while-Loop which gets me the number of a random cell (which is adjacent to the current cell; beginning from the starting point) until that cell number equals the cell number of my point of destination. My idea behind that was to "walk" across the raster until I have reached my point of destination or at least the column containing it (to reduce computation time). The numbers of the cells i cross during that walk should be stored in a vector ("Path"). I select the adjacent cell (=choose my next step) by randomly sampling from a vector that contains numbers that, when added to the current cell number, lead to the number of an adjacent cell. I have multiple vectors from which to sample as the number of possible directions in which to "walk" differs depending on the position of the current cell (e.g. I can't "walk" to the cell to my lower rigth (=n + (ncol_dispersal + 1) if I am currently positioned at the bottom of the raster). The script looks like this so far:
library(gdistance)
library(raster)
library(rgdal)
library(sp)
setwd("C:/Users/Giaco/Dropbox/Random Walk")
altdata <- raster("altitude.tif")
plot(altdata)
e <- extent(92760.79,93345.79,204017.5,204242.5)
dispersal_area <- crop(altdata,e)
plot(dispersal_area)
points(92790.79,204137.5,pch=16,cex=1)
points(93300.79,204062.5,pch=16,cex=1)
Pts <- matrix(c(92790.79,204137.5,93300.79,204062.5),nrow=2,ncol=2,byrow=TRUE)
Start <- cellFromXY(dispersal_area,Pts[1,])
End <- cellFromXY(dispersal_area,Pts[2,])
nrow_dispersal <- nrow(dispersal_area)
ncol_dispersal <- ncol(dispersal_area)
col_start <- colFromCell(dispersal_area,Start)
row_start <- rowFromCell(dispersal_area,Start)
col_end <- colFromCell(dispersal_area,End)
row_end <- rowFromCell(dispersal_area,End)
upper_left_corner <- cellFromRowCol(dispersal_area,1,1)
lower_left_corner <- cellFromRowCol(dispersal_area,14,1)
sample_standard <- c(1,(ncol_dispersal+1),(ncol_dispersal*-1+1))
sample_top <- c(1,ncol_dispersal,(ncol_dispersal+1))
sample_bottom <- c(1,(ncol_dispersal*-1+1),(ncol_dispersal*-1))
sample_left <- c(1,(ncol_dispersal+1),(ncol_dispersal*-1+1))
sample_upper_left <- c(1,ncol_dispersal,(ncol_dispersal+1))
sample_lower_left <- c(1,(ncol_dispersal*-1+1),(ncol_dispersal*-1))
Path <- c()
Path[1] <- Start
n <- Start
counter <- 1
while (n != End)
{
n = Start+sample(sample_standard,1)
if (colFromCell(dispersal_area,n)==col_end) {
n=End
break
} else if (n==upper_left_corner) {
n = n+sample(sample_upper_left,1)
} else if(n==lower_left_corner){
n = n+sample(sample_lower_left,1)
} else if(colFromCell(dispersal_area,n)==1) {
n = n+sample(sample_left,1)
} else if(rowFromCell(dispersal_area,n)==1){
n = n+sample(sample_top,1)
} else if(rowFromCell(dispersal_area,n)==nrow_dispersal) {
n = n+sample(sample_bottom,1)
}
counter <- counter+1
Path[counter] <- n
}
When I run the script and print the path vector it returns a veeerryy long vector (I always have to stop it as it never finishes computing) which contains only a few different numbers. Why is that ? I have been staring at this all day but I simply can't figure out where i went wrong. There must be something wrong with the while Loop but I don't see it.
If anyone of you guys could help me out with this I would be really really thankful.
Thanks in advance !
Here is a simple and reproducible example (that also answers your question).
library(gdistance)
r <- raster(system.file("external/maungawhau.grd", package="gdistance"))
r <- aggregate(r, 5)
p <- matrix(c(2667531, 6478843, 2667731, 6479227), ncol=2, byrow=TRUE)
start <- cellFromXY(r, p[1,])
end <- cellFromXY(r, p[2,])
counter <- 1
cell <- start
path <- cell
while (cell != end) {
a <- adjacent(r, cell, pairs=F)
cell <- sample(a, 1)
path <- c(path, cell)
}
xy <- xyFromCell(r, path)
plot(r)
lines(xy)
or
cols <- rainbow(nrow(xy))
for (i in 1:nrow(xy)-1) { lines(xy[i:(i+1), ], col=cols[i]) }
This is pretty fast on this coarse raster, but it could indeed take a very long time to reach a particular cell on a large raster by random walk.
Perhaps there are function in gdistance that are more useful?
I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH
I wrote a R code to find the minimum of a function using Gradient Descent Method below:
gradient.method <- function(f, grad, init, unit.fac=TRUE, interval=c(-7,10), tol=1e-11, max.iter = 35)
{
newpair <- init
oldpair <- newpair - 1
iter <- 0
while(iter < max.iter & sqrt(sum((newpair - oldpair)^2)) > tol){
iter <- iter + 1
oldpair <- newpair
#Set up the unit vector u
newstep <- if(unit.fac) grad(x)(oldpair)/sqrt(sum(grad(x)(oldpair)**2))
#Get minimum of f(x_0 - step_size*grad(x_0))
value <- function(step_size) oldpair - step_size*newstep
min <- optimize(f(x)(value(step_size)),interval)
#Get new pair of vector x
newpair <- oldpair - min*newstep
}
list(minimum = newpair, value = f(x)(newpair), nsteps = iter)
}
The functions for f and grad are as follows:
f1 <- function(x){
n<-length(x)
function(theta){
-logLike<- 0.5*n*log(theta[2])-(1/(2*theta[2]))*sum((x-theta[1])**2)
}
}
g1 <- function(x){
n <- length(x)
function(theta){
grd1 <- -sum((x - theta[1])*theta[2])
grd2 <- n/(theta[2]) - 0.5*sum(x - theta[1])
}
}
However, I kept getting an error regarding one of my variables: step_size when testing the code. How should I correct the problem? Thanks.
res<-gradient.method(f=f1, grad=g1, init=c(100,100), max.iter=100)
Error in value(step_size) : object 'step_size' not found
The error message is quite clear, you are trying to use variable step_size which has not been defined anywhere. The problem stems from the fact that you aren't using optimize function properly, you should give it the name of your function which is minimized with regards to it's first argument. From help page of optimize (use ?optimize):
f
the function to be optimized. The function is either minimized or
maximized over its first argument depending on the value of maximum.
So you should be using optimize like this:
value <- function(step_size) oldpair - step_size*newstep
fn<-function(step_size) f(x)(value(step_size))
min <- optimize(fn,interval)
Also the variable x is not defined anywhere, and this your functions f1 and g1 look bit weird, for example this is not valid code:
-logLike<- 0.5*n*log(theta[2])-(1/(2*theta[2]))*sum((x-theta[1])**2)
You are trying to assign something to variable called -logLike, but you cannot use - in variable name.
edit:
Check the documentation of optimize on what the function returns:
Value
A list with components minimum (or maximum) and objective which give
the location of the minimum (or maximum) and the value of the function
at that point.
So your variable min contains two elements although you probably need just the value of the minimum in the next line of your code.
I tried to write some functions to calculate anova power and sample size using non-central parameter.
There're some very good functions in R but my functions were to learn and reproduce line of thought from a biostatistical book...
Despite de math involved, my "nc" and "fpower" functions just work well, and as expected:
nc <- function(diff,n,sd) {
nonc <- (diff^2/2)*(n/sd^2)
return(nonc)
}
fpower <- function(k,n,diff,sd,alpha=0.05) {
nonc <- nc(diff,n,sd)
dfn <- k - 1
dfd <- k*(n-1)
f1 <- qf(1-alpha,dfn,dfd)
f2 <- pf(f1,dfn,dfd,nonc)
return(1-f2)
}
However, my "fsample" just doesn´t work as expected. Return 2, the first n in the seq.
fsample <- function(k,diff,sd,alpha=0.05,power=0.9){
for(n in 2:5000){
if ( fpower(k,n,sd,alpha) >= power) break
}
return(n)
}
But, if I "hand" run this code in console it work as expected!!
And return the right n value.
What's wrong?
You didn't pass the diff argument to fpower, so the arguments aren't in the order you think they are. fsample should be:
fsample <- function(k,diff,sd,alpha=0.05,power=0.9){
for(n in 2:5000){
if ( fpower(k,n,diff,sd,alpha) >= power) break
}
return(n)
}
Note that this wouldn't have been a problem if you had named the arguments when you called fpower because you would have received an error about diff being missing and not having a default value:
# this will error
fsample <- function(k,diff,sd,alpha=0.05,power=0.9){
for(n in 2:5000){
if ( fpower(k=k,n=n,sd=sd,alpha=alpha) >= power) break
}
return(n)
}
Also, you might want to avoid giving data objects the same name as functions (e.g. diff, sd, and power are also functions), otherwise you may confuse yourself.