I am trying to construct a new variable, z, using two pre-existing variables - x and y. Suppose for simplicity that there are only 5 observations (corresponding to 5 time periods) and that x=c(5,7,9,10,14) and y=c(0,2,1,2,3). I’m really only using the first observation in x as the initial value, and then constructing the new variable z using depreciated values of x[1] (depreciation rate of 0.05 per annum) and each of the observations over time in the vector, y. The variable I am constructing takes the form of a new 5 by 1 vector, z, and it can be obtained using the following simple commands in R:
z=NULL
for(i in 1:length(x)){n=seq(1,i,by=1)
z[i]=sum(c(0.95^(i-1)*x[1],0.95^(i-n)*y[n]))}
The problem I am having is that I need to define this operation as a function. That is, I need to create a function f that will spit out the vector z whenever any arbitrary vectors x and y are plugged into the function, f(x,y). I’ve been going around in circles for days now and I was wondering if someone would be kind enough to provide me with a suggestion about how to proceed. Thanks in advance.
I hope following will work for you...
x=c(5,7,9,10,14)
y=c(0,2,1,2,3)
getZ = function(x,y){
z = NULL
for(i in 1:length(x)){
n=seq(1,i,by=1)
z[i]=sum(c(0.95^(i-1)*x[1],0.95^(i-n)*y[n]))
}
return = z
}
z = getZ(x,y)
z
5.000000 6.750000 7.412500 9.041875 11.589781
This will allow .05 (or any other value) passed in as r.
ConstructZ <- function(x, y, r){
n <- length(y)
d <- 1 - r
Z <- vector(length = n)
for(i in seq_along(x)){
n = seq_len(i)
Z[i] = sum(c(d^(i-1)*x[1],d^(i-n)*y[n]))
}
return(Z)
}
Here is a cool (if I say so myself) way to implement this as an infix operator (since you called it an operation).
ff = function (x, y, i) {
n = seq.int(i)
sum(c(0.95 ^ (i - 1) * x[[1]],
0.95 ^ (i - n) * y[n]))
}
`%dep%` = function (x, y) sapply(seq_along(x), ff, x=x, y=y)
x %dep% y
[1] 5.000000 6.750000 7.412500 9.041875 11.589781
Doing the loop multiple times and recalculating the exponents every time may be inefficient. Here's another way to implement your calculation
getval <- function(x,y,lambda=.95) {
n <- length(y)
pp <- lambda^(1:n-1)
yy <- sapply(1:n, function(i) {
sum(y * c(pp[i:1], rep.int(0, n-i)))
})
pp*x[1] + yy
}
Testing with #vrajs5's sample data
x=c(5,7,9,10,14)
y=c(0,2,1,2,3)
getval(x,y)
# [1] 5.000000 6.750000 7.412500 9.041875 11.589781
but appears to be about 10x faster when testing on larger data such as
set.seed(15)
x <- rpois(200,20)
y <- rpois(200,20)
I'm not sure of how often you will run this or on what size of data so perhaps efficiency isn't a concern for you. I guess readability is often more important long-term for maintenance.
Related
What I am about to explain is kinda tricky, but I hope I can explain it clearly.
Suppose you have a function that does the Hodrick Prescott detrending, that is pretty much this:
The user picks up the λ value, and therefore for every λ it exists a series trend τ(λ).
Suppose you pick a number near 0 (on the positive side)
This number is V, for this case suppose V=0.0001278846
Then you want to compute this:
(I have the function that does)
But you want to find a λ so that F(λ) = V
How can I complete this?
I was trying to make a while statement, but could not state it correctly, then I made a for loop with an if statement to break the for loop when F(λ)-V = 0.
This is what my for loop looks like:
for(L in 1:3500){
F_ <- find_v(dataa, L)
if((F_-V)==0){
print(paste("The λ value for this series following Rule 1 is:", L))
break
}
cat(paste("The λ =",L,"has a (F-V) difference of:", (F_-V),"\n"))
where dataa is my data composed of 89 obs.
Using this for loop I see that (F-V) turns negative between L = 3276 and L = 3277.
Is there a better way to do it? Like optimization?
Because using the for loop it feels like I'm obtaining the optimal λ by the force.
Sorry for not getting my data or codes for the hodrick prescott detrending or the find_v function, they are way too long.
Since you are doing double optimization, consider the following:
The data
set.seed(0)
y <- rnorm(89)
The function to be optimized:
lfun <- function(tau, y, lambda){
n <- length(tau)
tt <- tau[-(1:2)] - 2 * tau[-c(1, n)] + head(tau, -2)
sum((y-tau)^2) + lambda *sum(tt^2)
}
The F function:
f_lambda <- function(lambda, y, V = 0){
tau <- optim(y,lfun,y = y, lambda = lambda, method = 'BFGS')$par
tt <- tail(tau,-2) - 2 * head(tau[-1], -1) + head(tau, -2)
sqrt((sum((y-tau)^2)/sum(tt^2) - V)^2)
}
Optimizing the F function:
optim(0.1, f_lambda, y = y, V=0.0001278846, method="Brent",lower=0, upper=100)
$par
[1] 0.003412824
$value
[1] 2.633131e-10
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
Now the lambda = 0.003412824 gives the desired V ie:
f_lambda(0.003412824, y)
[1] 0.0001278843
Which is very close to the V=0.0001278846 you started with.
I have the following exercise to be solved in R. Under the exercise, there is a hint towards the solution.
Exercise: If there are no ties in the data set, the function above will produce breakpoints with h observations in the interval between two consecutive breakpoints (except the last two perhaps). If there are ties, the function will by construction return unique breakpoints, but there may be more than h observations in some intervals.
Hint:
my_breaks <-function(x, h = 5) {
x <-sort(x)
breaks <- xb <- x[1]
k <- 1
for(i in seq_along(x)[-1])
{if(k<h)
{k <- k+1}
else{
if(xb<x[i-1]&&x[i-1]<x[i])
{xb <- x[i-1]
breaks <-c(breaks, xb)
k <- 1
}
}
}
However, I am having a hard time understanding the above function particularly the following lines
for(i in seq_along(x)[-1])
{if(k<h)
{k <- k+1}
Question:
How is the for loop supposed to act in k if k is previously defined as 1 and i is different than k? How are the breakpoints chosen according to the h=5 gap if the for loop is not acting on x? Can someone explain to me how this function works?
Thanks in advance!
First, note that your example is incomplete. The return value and the final brace are missing there. Here is the correct version.
my_breaks <-function(x, h = 5) {
x <- sort(x)
breaks <- xb <- x[1]
k <- 1
for(i in seq_along(x)[-1]){
if(k<h) {
k <- k+1
} else {
if(xb<x[i-1]&&x[i-1]<x[i]){
xb <- x[i-1]
breaks <-c(breaks, xb)
k <- 1
}
}
}
breaks
}
Let's check if it works.
my_breaks(c(1,1,1:5,8:10), 2)
#[1] 1 2 4 8
my_breaks(c(1,1,1:5,8:10), 5)
#[1] 1 3
As you can see, everything is fine. And what is seq_along(x)[-1]? We could write this equation as 2:length(x). So the for loop goes through each element of the vector x in sequence, skipping the first element.
What is the k variable for? It counts the distance to take into account the h parameter.
I am trying to write a function in R to generate n random variables from x using sample () function when x~Ge(p) (it means x has geometric distribution). In my function I would like to use a while loop.
I think my function needs two inputs as size and p. I need also a for loop in my function. What I think will work is something like a below framework for my function:
rGE <- function(size,p){
for
i<-1
while()
...
return(i)
}
I would like to develope my above function in order to generate n random variables from x (when x~Ge(p))
For a home-grown, inefficient (but comprehensible) version of rgeom, something like this should work:
my_rgeom <- function(n, p) {
x <- numeric(n) ## allocate space for the results (all zeros)
for (i in seq(n)) {
done <- FALSE
while (!done) {
x[i] <- x[i] + 1
done <- runif(1)<p
}
}
return(x)
}
I'm sure you could use sample() instead of runif() for the innermost loop, but it's not obvious to me how. One piece of advice: if you're unfamiliar with programming, try writing your proposed algorithm out as pseudocode rather than jumping in to R-bashing right away. It can be easier if you deal with the logic and the coding nuts-and-bolts separately ...
You could use rgeom:
set.seed(1)
rgeom(n = 10, p = .1)
#> [1] 6 3 23 3 24 13 15 2 20 3
I have finally written the below function:
rge<- function(n, p) {
x <- numeric(n)
for (i in seq(n)) {
j <- 0
while (j==0) {
x[i] <- x[i] + 1
j <- sum(sample(0:1,replace=TRUE,prob=c(1-p,p)))
}
}
return(x)
}
rge(10,.2)
I hope it really generates n random variables number from geometric distribution.
I'm trying to compute a kind of Gini index using a generated dataset.
But, I got a problem in the last integrate function.
If I try to integrate the function named f1,
R says
Error in integrate(Q, 0, p) : length(upper) == 1 is not TRUE
My code is
# set up parameters b>a>1 and the number of observations n
n <- 1000
a <- 2
b <- 4
# generate x and y
# where x follows beta distribution
# y = 10x+3
x <- rbeta(n,a,b)
y <- 10*x+3
# the starting point of the integration having problem
Q <- function(q) {
quantile(y,q)
}
# integrate the function Q from 0 to p
G <- function(p) {
integrate(Q,0,p)
}
# compute a function
L <- function(p) {
numer <- G(p)$value
dino <- G(1)$value
numer/dino
}
# the part having problem
d <- 3
f1 <- function(p) {
((1-p)^(d-2))*L(p)
}
integrate(f1,0,1) # In this integration, the aforementioned error appears
I think, the repeated integrate could make a problem but I have no idea what is the exact problem.
Please help me!
As mentioned by #John Coleman, integrate needs to have a vectorized function and a proper subdivisions option to fulfill the integral task. Even if you have already provided a vectorized function for integral, it is sometimes tricky to properly set the subdivisions in integrate(...,subdivisions = ).
To address your problem, I recommend integral from package pracma, where you still a vectorized function for integral (see what I have done to functions G and L), but no need to set subdivisions manually, i.e.,
library(pracma)
# set up parameters b>a>1 and the number of observations n
n <- 1000
a <- 2
b <- 4
# generate x and y
# where x follows beta distribution
# y = 10x+3
x <- rbeta(n,a,b)
y <- 10*x+3
# the starting point of the integration having problem
Q <- function(q) {
quantile(y,q)
}
# integrate the function Q from 0 to p
G <- function(p) {
integral(Q,0,p)
}
# compute a function
L <- function(p) {
numer <- Vectorize(G)(p)
dino <- G(1)
numer/dino
}
# the part having problem
d <- 3
f1 <- function(p) {
((1-p)^(d-2))*L(p)
}
res <- integral(f1,0,1)
then you will get
> res
[1] 0.1283569
The error that you reported is due to the fact that the function in integrate must be vectorized and integrate itself isn't vectorized.
From the help (?integrate):
f must accept a vector of inputs and produce a vector of function
evaluations at those points. The Vectorize function may be helpful to
convert f to this form.
Thus one "fix" is to replace your definition of f1 by:
f1 <- Vectorize(function(p) {
((1-p)^(d-2))*L(p)
})
But when I run the resulting code I always get:
Error in integrate(Q, 0, p) : maximum number of subdivisions reached
A solution might be to assemble a large number of quantiles and then smooth it out and use that rather than your Q, although the error here strikes me as odd.
I want to use arms() to get one sample each time and make a loop like the following one in my function. It runs very slowly. How could I make it run faster? Thanks.
library(HI)
dmat <- matrix(0, nrow=100,ncol=30)
system.time(
for (d in 1:100){
for (j in 1:30){
y <- rep(0, 101)
for (i in 2:100){
y[i] <- arms(0.3, function(x) (3.5+0.000001*d*j*y[i-1])*log(x)-x,
function(x) (x>1e-4)*(x<20), 1)
}
dmat[d, j] <- sum(y)
}
}
)
This is a version based on Tommy's answer but avoiding all loops:
library(multicore) # or library(parallel) in 2.14.x
set.seed(42)
m = 100
n = 30
system.time({
arms.C <- getNativeSymbolInfo("arms")$address
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
if (diff(bounds) < 1e-07) stop("pointless!")
# create the vector of z values
zval <- 0.00001 * rep(seq.int(n), m) * rep(seq.int(m), each = n)
# apply the inner function to each grid point and return the matrix
dmat <- matrix(unlist(mclapply(zval, function(z)
sum(unlist(lapply(seq.int(100), function(i)
.Call(arms.C, bounds, function(x) (3.5 + z * i) * log(x) - x,
0.3, 1L, parent.frame())
)))
)), m, byrow=TRUE)
})
On a multicore machine this will be really fast since it spreads the loads across cores. On a single-core machine (or for poor Windows users) you can replace mclapply above with lapply and get only a slight speedup compared to Tommy's answer. But note that the result will be different for parallel versions since it will use different RNG sequences.
Note that any C code that needs to evaluate R functions will be inherently slow (because interpreted code is slow). I have added the arms.C just to remove all R->C overhead to make moli happy ;), but it doesn't make any difference.
You could squeeze out a few more milliseconds by using column-major processing (the question code was row-major which requires re-copying as R matrices are always column-major).
Edit: I noticed that moli changed the question slightly since Tommy answered - so instead of the sum(...) part you have to use a loop since y[i] are dependent, so the function(z) would look like
function(z) { y <- 0
for (i in seq.int(99))
y <- y + .Call(arms.C, bounds, function(x) (3.5 + z * y) * log(x) - x,
0.3, 1L, parent.frame())
y }
Well, one effective way is to get rid of the overhead inside arms. It does some checks and calls the indFunc every time even though the result is always the same in your case.
Some other evaluations can be also be done outside the loop. These optimizations bring down the time from 54 secs to around 6.3 secs on my machine. ...and the answer is identical.
set.seed(42)
#dmat2 <- ##RUN ORIGINAL CODE HERE##
# Now try this:
set.seed(42)
dmat <- matrix(0, nrow=100,ncol=30)
system.time({
e <- new.env()
bounds <- 0.3 + convex.bounds(0.3, dir = 1, function(x) (x>1e-4)*(x<20))
f <- function(x) (3.5+z*i)*log(x)-x
if (diff(bounds) < 1e-07) stop("pointless!")
for (d in seq_len(nrow(dmat))) {
for (j in seq_len(ncol(dmat))) {
y <- 0
z <- 0.00001*d*j
for (i in 1:100) {
y <- y + .Call("arms", bounds, f, 0.3, 1L, e)
}
dmat[d, j] <- y
}
}
})
all.equal(dmat, dmat2) # TRUE
why not like this?
dat <- expand.grid(d=1:10, j=1:3, i=1:10)
arms.func <- function(vec) {
require(HI)
dji <- vec[1]*vec[2]*vec[3]
arms.out <- arms(0.3,
function(x,params) (3.5 + 0.00001*params)*log(x) - x,
function(x,params) (x>1e-4)*(x<20),
n.sample=1,
params=dji)
return(arms.out)
}
dat$arms <- apply(dat,1,arms.func)
library(plyr)
out <- ddply(dat,.(d,j),summarise, arms=sum(arms))
matrix(out$arms,nrow=length(unique(out$d)),ncol=length(unique(out$j)))
However, its still single core and time consuming. But that isn't R being slow, its the arms function.