Some data:
a <- function(a) {3*a+12*a^2}
dom1 <- seq(-1,4,0.1)
vec1 <- a(dom1)[1:10]
c <- function(c) {-5*c^2+2*c^3}
dom2 <- seq(-0.1,0.2,0.01)
vec2 <- c(dom2)[1:10]
d <- function(d) {2*d^2+5*d^3+12*d^4}
dom3 <- seq(0.1,0.5,0.01)
vec3 <- d(dom3)[1:10]
w <- function(w) {7*w-3*w^2}
dom4 <- seq(0.5,2.5,0.05)
vec4 <- w(dom4)[1:10]
Now suppose we fit lm model on the larger data set lm(y~a+c+d+w) and the lm parameters are c(-0.2,0.2,0.1,0.6)
fun.mean <- function(a,c,d,w) {-0.2*a+0.2*c+0.1*d+0.6*w}
What I tried, but doesn't work as expected:
Here is loop through the vector generated from the relevant function (&domains)
for(a in vec1) {
for(c in vec2) {
for(d in vec3) {
for(w in vec4) {
sol <- fun.mean(a,c,d,w)
if (sol %% 1 > 0.40 & sol < 0.50) print(c(a,c,d,w))
}}}}
So what I'm looking for is to find combinations of c(a,c,d,w), which equal to 0.5 or ideally would equal to interval 0.4-0.5.
So multiplying the c(a,c,d,w) output with the function fun.mean would not give the desired values (interval 0.4-0.5). What I'm doing wrong? Would be there better approach to find the c(a,c,d,w) values given the "target" value? What would be the alternative to for loop as it is very slow.
You filter using %%1 but you don't use it when you check the answer.
The following code returns 9.00 -0.052000 0.02620000 2.7500
which gives -0.15778. It matches your criteria (-0.15778 %% 1 = 0.85 which is > 0.4
and -0.15 < 0.5).
I guess your criteria are wrong then, you should probably do either
sol %% 1 > 0.4 and sol %% 1 < 0.5
or
sol > 0.4 and sol < 0.5.
In the first case, maybe you should add the %% 1 in fun.mean calculation.
Related
I want to limit the 2 bounds of a vector in a IF condition. However I get the warnings "the condition has length > 1 and only the first element will be used" when I try to use the following function :
rho <- c(0.8,0,-0.5)
sigma.S <- 0.4
sigma.M <- 0.1
mu.S <- 0.06
T <- 1
N <- 365
dt <- T/N
m <- c(100,102,100,99,101)
z <- rnorm(N)
P <- matrix(0, N, 1)
P[1] <- m[1]
for (i in 2:N){
P[i] <- P[i-1]*( 1 + sigma.M*sqrt(dt)*z[i] )
}
tPts <- c(0,91,182,273,364)
yPts <- c(m[1]-P[1],m[2]-P[92],m[3]-P[183],m[4]-P[274],m[5]-P[365])
a <- tPts[1]
A <- yPts[1]
for(i in 2:5){
t <- seq(0,364,1)
b <- tPts[i]
B <- yPts[i]
if(a<=t & t<=b){
y <- ( B*(t-a) - A*(t-b) )/(b-a)
return(y)
}
a <- b
A <- B
}
Can anyone see what the problem is here? Thanks in advance!
We could change the if condition inside the loop by wrapping with all
if(all(a<=t) & all(t<=b))
assuming that we need condition to meet along the length of 't'
as a <= t or t <= b returns a logical vector of the same length as 't' and here 't' is created as a sequence from 0 to 364 i.e. even if one of the vector is of length 1 i.e. 'a' or 'b', the comparison operator does a recycling of that element to the do comparison across the larger length vector
5 < (1:6)
#[1] FALSE FALSE FALSE FALSE FALSE TRUE
and if/else requires input to be of length 1.
I'm a beginning R programmer. I have trouble in a loop calculation with a previous value like recursion.
An example of my data:
dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))
And calculated value 'c' is y[n] = (y[n-1] + b[n])*a[n]. Initial value of c is 0. (c[1] = 0)
I used the for loop and the code and result is as below.
dt$y <- 0
for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
}
a b y
1: 0 0 0
2: 1 1 1
3: 2 2 6
4: 3 1 21
5: 4 3 96
This result is what I want. However, my data has over 1,000,000 rows and several columns, therefore I'm trying to find other ways without using a for loop. I tried to use "Reduce()", but it only works with a single vector (ex. y[n] = y_[n-1]+b[n]). As shown above, my function uses two vectors, a and b, so I can't find a solution.
Is there a more efficient way to be faster without using a for loop, such as using a recursive function or any good package functions?
This kind of computation cannot make use of R's advantage of vectorization because of the iterative dependencies. But the slow-down appears to really be coming from indexing performance on a data.frame or data.table.
Interestingly, I was able to speed up the loop considerably by accessing a, b, and y directly as numeric vectors (1000+ fold advantage for 2*10^5 rows) or as matrix "columns" (100+ fold advantage for 2*10^5 rows) versus as columns in a data.table or data.frame.
This old discussion may still shed some light on this rather surprising result: https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html
Please note that I also made a different toy data.frame, so I could test a larger example without returning Inf as y grew with i:
Option data.frame (numeric vectors embedded in a data.frame or data.table per your example):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
})
#user system elapsed
#79.39 146.30 225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.
tail(dt$y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option vector (numeric vectors extracted in advance of loop):
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
#user system elapsed
#0.03 0.00 0.03
tail(y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
Option matrix (data.frame converted to matrix before loop):
vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#0.67 0.01 0.69
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns
Option data.frame with matrix style indexing:
vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
})
#user system elapsed
#110.69 0.03 112.01
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
An option is to use Rcpp since for this recursive equation is easy to code in C++:
library(Rcpp)
cppFunction("
NumericVector func(NumericVector b, NumericVector a) {
int len = b.size();
NumericVector y(len);
for (int i = 1; i < len; i++) {
y[i] = (y[i-1] + b[i]) * a[i];
}
return(y);
}
")
func(c( 0, 1, 2, 1, 3), c(0:4))
#[1] 0 1 6 21 96
timing code:
vec_length <- 1e7
dt <- data.frame(a=1:vec_length, b=1:vec_length, y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
y[i] <- (y[i - 1] + b[i]) * a[i]
})
# user system elapsed
# 19.22 0.06 19.44
system.time(func(b, a))
# user system elapsed
# 0.09 0.02 0.09
Here is a base R solution.
According to the information from #ThetaFC, an indication for speedup is to use matrix or vector (rather than data.frame for data.table). Thus, it is better to have the following preprocessing before calculating df$y, i.e.,
a <- as.numeric(df$a)
b <- as.numeric(df$b)
Then, you have two approaches to get df$y:
writing your customized recursion function
f <- function(k) {
if (k == 1) return(0)
c(f(k-1),(tail(f(k-1),1) + b[k])*a[k])
}
df$y <- f(nrow(df))
Or a non-recursion function (I guess this will be much faster then the recursive approach)
g <- Vectorize(function(k) sum(rev(cumprod(rev(a[2:k])))*b[2:k]))
df$y <- g(seq(nrow(df)))
such that
> df
a b y
1 0 0 0
2 1 1 1
3 2 2 6
4 3 1 21
5 4 3 96
I don't think this will be any faster, but here's one way to do it without an explicit loop
dt[, y := purrr::accumulate2(a, b, function(last, a, b) (last + b)*a
, .init = 0)[-1]]
dt
# a b y
# 1: 0 0 0
# 2: 1 1 1
# 3: 2 2 6
# 4: 3 1 21
# 5: 4 3 96
I have a raster stack with 364 layers with a daily rate of change in NDVI values.
I want to scale these values in every cell if positive from 0 to 1 and if negative from -1 to 0. So far I have only found a solution that scale values in single layers (see here: Replace specific value in each band of raster brick in R) and not along cells of multilayer objects. Additionally I have a decent amount of cells with NA for the entire time series and I'm not quite sure how to deal with this fact either.
I took the code from the previously mentioned post and tried to get it working for my problem:
norm <- function(x){-1+(x-min)*((1-(-1))/(max-min))}
for(j in 1:ncell(tif)){
if(is.na(sum(tif[j]))){
NULL
} else {
cat(paste("Currently processing layer:", j,"/",ncell(tif), "\n"))
min <- cellStats(tif[j],'min')
max <- cellStats(tif[j],'max')
#initialize cluster
#number of cores to use for clusterR function (max recommended: ncores - 1)
beginCluster(31)
#normalize
tif[j] <- clusterR(tif[j], calc, args=list(fun=norm), export=c('min',"max"))
#end cluster
endCluster()
}
}
I'm not quite certain if this produces the desired output. Any help is very much appreciated!
Some example data
library(raster)
r <- raster(ncol=10, nrow=10)
s <- stack(lapply(1:5, function(i) setValues(r, runif(100, -1, 1))))
# adding NAs
s[[2]][sample(100, 25, TRUE)] <- NA
For scaling (or any other operation) by cell (as requested) you can use calc together with a function that works on a vector. For example:
ff <- function(i) {
p <- which(i >= 0)
n <- which(i <= 0)
# positive values
if (length(p) > 0) {
i[p] <- i[p] - min(i[p], na.rm=TRUE)
i[p] <- i[p] / max(i[p])
}
# negative values
if (length(n) > 0) {
i[n] <- i[n] - max(i[n], na.rm=TRUE)
i[n] <- i[n] / abs(min(i[n]))
}
i
}
Test it
ff(c(-.3, -.1, .1, .4, .8))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000
ff(c(-.3, -.1, .1, .4, .8, NA))
#[1] -1.0000000 0.0000000 0.0000000 0.4285714 1.0000000 NA
ff(c(-2,-1))
#[1] -1 0
ff(c(NA, NA))
#[1] NA NA
And use it
z <- calc(s, ff)
See the below to scale by layer, based on the min and max of all cell values (I first thought that this is what was asked for). Note that the functions I used below scale values from -1 to 1, but not the lowest positive value and highest negative value to zero.
minv <- abs(cellStats(s,'min'))
maxv <- cellStats(s,'max')
f1 <- function(i, mn, mx) {
j <- i < 0
j[is.na(j)] <- TRUE
i[j] <- i[j] / abs(mn)
i[!j] <- i[!j] / mx
i
}
ss <- list()
for (i in 1:nlayers(s)) {
ss[[i]] <- calc(s[[i]], fun=function(x) f1(x, minv[i], maxv[i]))
}
ss1 <- stack(ss)
Or without a loop
f2 <- function(x, mn, mx) {
x <- t(x)
i <- which(x > 0)
i[is.na(i)] <- FALSE
mxx <- x / mx
x <- x / mn
x[i] <- mxx[i]
t(x)
}
ss2 <- calc(s, fun=function(x) f2(x, minv, maxv))
For reference, to simply scale between 0 and 1
mnv <- cellStats(s,'min')
mxv <- cellStats(s,'max')
x <- (s - mnv) / (mxv - mnv)
To get values between -1 and 1 you can then do
y <- 2 * (x - 1)
But that way previously negative values can become positive and vice versa.
See ?raster::scale for other types of scaling.
I am trying to generate n random numbers whose sum is less than 1.
So I can't just run runif(3). But I can condition each iteration on the sum of all values generated up to that point.
The idea is to start an empty vector, v, and set up a loop such that for each iteration, i, a runif() is generated, but before it is accepted as an element of v, i.e. v[i] <- runif(), the test sum(v) < 1 is carried out, and while FALSE the last entry v[i] is finally accepted, BUT if TRUE, that is the sum is greater than 1, v[i] is tossed out of the vector, and the iteration i is repeated.
I am far from implementing this idea, but I would like to resolve it along the lines of something similar to what follows. It's not so much a practical problem, but more of an exercise to understand the syntax of loops in general:
n <- 4
v <- 0
for (i in 1:n){
rdom <- runif(1)
if((sum(v) + rdom) < 1) v[i] <- rdom
}
# keep trying before moving on to iteration i + 1???? i <- stays i?????
}
I have looked into while (actually I incorporated the while function in the title); however, I need the vector to have n elements, so I get stuck if I try something that basically tells R to add random uniform realizations as elements of the vector v while sum(v) < 1, because I can end up with less than n elements in v.
Here's a possible solution. It doesn't use while but the more generic repeat. I edited it to use a while and save a couple of lines.
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
while (i < n) {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
}
v
# [1] 0.89669720 0.06178627 0.01339033 0.02333120
Using a repeat block, you must check for the condition anyways, but, removing the growing problem, it would look very similar:
set.seed(0)
n <- 4
v <- numeric(n)
i <- 0
repeat {
ith <- runif(1)
if (sum(c(v, ith)) < 1) {
i <- i+1
v[i] <- ith
}
if (i == 4) break
}
If you really want to keep exactly the same procedure that you have posted (aka iteratively sample the n values one at a time from the standard uniform distribution, rejecting any samples that cause your sum to exceed 1), then the following code is mathematically equivalent, shorter, and more efficient:
samp <- function(n) {
v <- rep(0, n)
for (i in 1:n) {
v[i] <- runif(1, 0, 1-sum(v))
}
v
}
Basically, this code uses the mathematical fact that if the sum of the vector is currently sum(v), then sampling from the standard uniform distribution until you get a value no greater than 1-sum(v) is exactly equivalent to sampling in the uniform distribution from 0 to 1-sum(v). The advantage of using the latter approach is that it's much more efficient -- we don't need to keep rejecting samples and trying again, and can instead just sample once for each element.
To get a sense of the runtime differences, consider sampling 100 observations with n=10, comparing to a working implementation of the code from your post (copied from my other answer to this question):
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
set.seed(144)
system.time(samples.OP <- replicate(100, OP(10)))
# user system elapsed
# 261.937 1.641 265.805
system.time(samples.josliber <- replicate(100, samp(10)))
# user system elapsed
# 0.004 0.001 0.004
In this case, the new approach is approaching 100,000 times faster.
It sounds like you're trying to uniformly sample from a space of n variables where the following constraints hold:
x_1 + x_2 + ... + x_n <= 1
x_1 >= 0
x_2 >= 0
...
x_n >= 0
The "hit and run" algorithm is the mathematical machinery that enables you to do exactly this. In 2-dimensional space, the algorithm will sample uniformly from the following triangle, with each location in the shaded area being equally likely to be selected:
The algorithm is provided in R through the hitandrun package, which requires you to specify the linear inequalities that define the space through a constraint matrix, direction vector, and right-hand side vector:
library(hitandrun)
n <- 3
constr <- list(constr = rbind(rep(1, n), -diag(n)),
dir = c(rep("<=", n+1)),
rhs = c(1, rep(0, n)))
set.seed(144)
samples <- hitandrun(constr, n.samples=1000)
head(samples, 10)
# [,1] [,2] [,3]
# [1,] 0.28914690 0.01620488 0.42663224
# [2,] 0.65489979 0.28455231 0.00199671
# [3,] 0.23215115 0.00661661 0.63597912
# [4,] 0.29644234 0.06398131 0.60707269
# [5,] 0.58335047 0.13891392 0.06151205
# [6,] 0.09442808 0.30287832 0.55118290
# [7,] 0.51462261 0.44094683 0.02641638
# [8,] 0.38847794 0.15501252 0.31572793
# [9,] 0.52155055 0.09921046 0.13304728
# [10,] 0.70503030 0.03770875 0.14299089
Breaking down this code a bit, we generated the following constraint matrix:
constr
# $constr
# [,1] [,2] [,3]
# [1,] 1 1 1
# [2,] -1 0 0
# [3,] 0 -1 0
# [4,] 0 0 -1
#
# $dir
# [1] "<=" "<=" "<=" "<="
#
# $rhs
# [1] 1 0 0 0
Reading across the first line of constr$constr we have 1, 1, 1 which indicates "1*x1 + 1*x2 + 1*x3". The first element of constr$dir is <=, and the first element of constr$rhs is 1; putting it together we have x1 + x2 + x3 <= 1. From the second row of constr$constr we read -1, 0, 0 which indicates "-1*x1 + 0*x2 + 0*x3". The second element of constr$dir is <= and the second element of constr$rhs is 0; putting it together we have -x1 <= 0 which is the same as saying x1 >= 0. The similar non-negativity constraints follow in the remaining rows.
Note that the hit and run algorithm has the nice property of having the exact same distribution for each of the variables:
hist(samples[,1])
hist(samples[,2])
hist(samples[,3])
Meanwhile, the distribution of the samples from your procedure will be highly uneven, and as n increases this problem will get worse and worse.
OP <- function(n) {
v <- rep(0, n)
for (i in 1:n){
rdom <- runif(1)
while (sum(v) + rdom > 1) rdom <- runif(1)
v[i] <- rdom
}
v
}
samples.OP <- t(replicate(1000, OP(3)))
hist(samples.OP[,1])
hist(samples.OP[,2])
hist(samples.OP[,3])
An added advantage is that the hit-and-run algorithm appears faster -- I generated these 1000 replicates in 0.006 seconds on my computer with hit-and-run and it took 0.3 seconds using the modified code from the OP.
Here's how I would do it, without any loop, if or while:
set.seed(123)
x <- runif(1) # start with the sum that you want to obtain
n <- 4 # number of generated random numbers, can be chosen arbitrarily
y <- sort(runif(n-1,0,x)) # choose n-1 random points to cut the range [0:x]
z <- c(y[1],diff(y),x-y[n-1]) # result: determine the length of the segments
#> z
#[1] 0.11761257 0.10908627 0.02723712 0.03364156
#> sum(z)
#[1] 0.2875775
#> all.equal(sum(z),x)
#[1] TRUE
The advantage here is that you can determine exactly which sum you want to obtain and how many numbers n you want to generate for this. If you set, e.g., x <- 1 in the second line, the n random numbers stored in the vector z will add up to one.
Say I have a vector v = c(250,1200,700), a starting value n and a function e.g.
f = function(v){
g = function(v){
cases(
v <= 20 -> 0.1,
v > 20 & v <= 100 -> 0.075,
v > 100 -> .05
)
}
suppressWarnings(g(v))
}
f is written using cases from the memisc package - I'm still new to R and would be keen to hear if f can be coded in a 'better' way. Anyway, I am looking for code that will perform the following recursive process (including for vectors of a 'large' length):
f(n),
f(n)*v[1]+n,
f(f(n)*v[1]+n)*v[2] + f(n)*v[1]+n,
f(f(f(n)*v[1]+n)*v[2] + f(n)*v[1]+n)*v[3] + f(f(n)*v[1]+n)*v[2] + f(n)*v[1]+n
Ultimately I am interested in the value of the last line.
Cheers for any help
If I understood you right, this is the process you're talking about:
X1 = f(n)
X2 = X1*v[1] + n
X3 = F(X2)*v[2] + X2
X4 = F(X3)*v[3] + X3
...
If you need all in-between steps, a recursive function is rather useless as you need the in-between steps stored in the result as well. So you can easily code that using basic R :
Thefun <- function(v,n){
l <- length(v)
res <- numeric(l+1)
res[1] <- g(n)
res[2] <- res[1]*v[1] + n
for(i in seq(2,l)){
res[i+1] <- res[i] + g(res[i])*v[i]
}
return(res)
}
The last value of the result is the result you need. As you only needed the result of the last step, you can do it recursively using Recall:
Recfunc <- function(v,n){
l <- length(v)
if(l > 0){
res <- Recall(v[-l],n)
return(g(res)*v[l] + res)
} else {
return(n)
}
}
On a sidenote
You can define your function g different, like this (I call it fv) :
fv <- function(v){
0.1*(v <= 20) + 0.075*(v > 20 & v <=100) + 0.05*(v>100)
}
If compared to your function, you gain a 6 fold increase in speed.
vec <- sample(1:150,1e5,TRUE)
benchmark(
fv(vec),
g(vec),
columns=c("test","replications","elapsed","relative"),
replications = 1000
)
test replications elapsed relative
1 fv(vec) 1000 9.39 1.000
2 g(vec) 1000 56.30 5.996
I assume here that n is length of v.
I rewrite the recusrion like this :
y1 <- n ## slight change here
y2 <- f(y1)*v[1] +y1,
y3 <- f(y2)*v[2] +y2
y4 <- f(y3)*v[3] +y3
.... I can''t see the terms > length(v) so my first assumption
So for example you can implement this like :
filter.f <- function(func=f,coef=v){
n <- length(coef)
y <- numeric(n)
y[1] <- n
for(i in 2:n)
y[i] <- func(y[i-1])*coef[i-1]+y[i-1] ## here the recursion
y[1] <- f(n)
y
}
filter.f()
[1] 0.1 124.0 159.0 191.5
v=c(250, 1200, 700)
filter.f()
[1] 0.1 28.0 118.0