perform transformation on vector with if-condition in R - r

I am running regression models in R from PHP script. In fact, I am performing a regression...
fit <- lm(VIEWS ~ TREND + STUNDE + WOCHENTAG + MONAT * JAHR)
...then apply box-cox-transformation on the results, fetching the "best" lambda inside a range that makes sense in my case.
bc <- boxcox(fit, lambda = seq(0, 0.5, 0.005))
lambda <- bc$x[which.max(bc$y)]
If lamda is not 0, I apply the actual transformation and everything works fine:
VIEWS_BOX <- VIEWS^lambda - 1
I repeat the regression by replacing VIEWS by VIEWS_BOX.
My problem:
I need to write an if-condition for the case that lambda = 0, I read a lot and tried many things, but it just does not work. E.g. I tried:
VIEWS_BOX <- ifelse(rep(lambda = 0, length(VIEWS)), ln(VIEWS), VIEWS^lambda - 1)
Can anyone help me? Thanks in advance!

In your code, you used lambda = 0, but = means assignment. It should be lambda == 0, since == means comparison.
ifelse is not necessary. You can use:
VIEWS_BOX <- if (lambda) VIEWS^lambda - 1 else log(VIEWS)

Related

Solving an function with an iterative method

I am trying to solve a function where the values are implicit given within two equations. I have to use an iterative method.
Values that are already given are V_DCL, barrier, us_r, time.
My code for the function is:
Test <- function(V_DCL, barrier, us_r, time, vola_A) {
V_A <- V_DCL/(pnorm(d_1, 0, 1)) + (barrier*exp(us_r*time)*pnorm(d_2, 0, 1))/(pnorm(d_1, 0, 1))
d_1 <- (log(V_A/barrier)+(us_r + 0.5*vola_A^2)*t)/(vola_A*sqrt(time))
d_2 <- d_1 - vola_A*sqrt(time)
outputs <- list(V_A, d_1, d_2, vola)
return(outputs)
}
I need to get a value for V_A and vola_A.
This function is similar to the black scholes formula. But I do not have a value for the value of the asset but for the liabilities, so I rearranged it.
So far I know that I have to make an initial guess for vola_A which need to get changed until all equations fit.
I already looked into the base repeat() function and in the package simecol. But I did not figure out how to apply it on my code.
Can you give me some ideas? Thank you.
Edit (additional information):
The given data for us_r is 0.05, time stands for time horizon and I will work with 1 period for now. The given value for barrier is 2.683782e+13 and for V_DCL it is 4.732741e+11.

Nested integration for incomplete convolution of gauss densities

Let g(x) = 1/(2*pi) exp ( - x^2 / 2) be the density of the normal distribution with mean 0 and standard deviation 1. In some calculation on paper appeared integrals of the form
where c>0 is a positive number.
Since I could not evaluate this by hand, I had the idea to approximate and plot it. I tried this in R, because R provides the dnorm function and a function to do integrals.
You see that I need to integrate numerically n times, where n shall be chosed by the call of a plot function. My code has an for-loop to create those "incomplete" convolutions iterativly.
For example even with n=3 and c=1 this gives me an error. n=2 (thus it's one integration) works.
N = 3
ngauss <- function(x) dnorm(x , mean = 0, sd = 1)
convoluts <- list()
convoluts[[1]] <- ngauss
for (i in 2:N) {
h <- function(y) {
g <- function(z) {ngauss(y-z)*convoluts[[i-1]](z)}
return(integrate(g, lower = -1, upper = 1)$value)
}
h <- Vectorize(h)
convoluts[[i]] <- h
}
convoluts[[3]](0)
What I get is:
Error: evaluation nested too deeply: infinite recursion /
options(expressions=)?
I understand that this is a hard computation, but for "small" n something similar should possible.
Maybe someone can help me to fix my code or provide a recommendation how I can implement this in a better way. Another language that is more appropriate for this would be also okay.
The issue appears to be in how integrate deals with variables in different environments. In particular, it doesn't really deal with i correctly in each iteration. Instead using
h <- evalq(function(y) {
g <- function(z) {ngauss(y - z) * convoluts[[i - 1]](z)}
integrate(g, lower = -1, upper = 1)$value
}, list(i = i))
does the job and, say, setting N <- 6 quickly gives
convoluts[[N]](0)
# [1] 0.03423872
As your integration is simply the pdf of a sum of N independent standard normals (which then follows N(0, N)), we may also verify this approach by setting lower = -Inf and upper = Inf. Then with N <- 4 we have
dnorm(0, sd = sqrt(N))
# [1] 0.1994711
convoluts[[N]](0)
# [1] 0.1994711
So, for practical purposes, when c = Inf, you are way better off using dnorm rather than manual computations.

Two variable function maximization - R code

So I'm trying to maximize the likelihood function for a gamma-poisson and I've programmed it into R as the following:
lik<- function(x,t,a,b){
for(i in 1:n){
like[i] =
log(gamma(a + x[i]))-log(gamma(a))
-log(gamma(1+x[i] + x[i]*log(t[i]/b)-(a+x[i])*log(1+t[i]/b)
}
return(sum(like))
}
where x and t are the data, and I have n data rows.
I need a and b to be solved for simultaneously. Does a built in function exist in R? Or do I need to hard code an algorithm to solve the system of equations? [I'd rather not] I know optimize() solves for 1 variable and so does fminbnd(). I'm trying to copy the behavior of FindMaximum() in mathematica. In a perfect world I'd like the code to work something like this:
optimize(f=lik, a>0, b>0, x=x, t=t, maximum=TRUE, iteration=5000)
$maximum
a 150
b 6
Thanks.
optim's first argument can be a vector of parameters. So you could try something like this:
lik <- function(p=c(1,1), x, t){
# In the body of the function replace a by p[1] and b by p[2]
}
optim(c(1,1), lik, method = c("L-BFGS-B"), x=x, t=t, control=list(fnscale=-1))
So the solution that ended up working out is:
attempt2d <- optim(
par = c(sumx/sumt, 1), fn = lik, data = data11,
method = "L-BFGS-B", control = list(fnscale = -1, trace=TRUE),
lower=0.1, upper = 170
)
However my parameters run out to 170, essentially meaning that my gamma parameters are Inf. Because gamma() hits infinity relatively quickly. And in mathematica the solutions are a=169 and b=16505, and R gets nowhere near that maxing out at 170. The known solutions are beyond 170 in some cases any solution for this anomaly?

Generate some simple dummy data in R

I just want some random data to experiment with different prediction models.
My code:
x <- 0
for (i in 1:200)
{
num <- runif(1, 0, 500)
neg <- round(runif(5, -1, 0))
percent <- ((0.01 * runif(1, 1, 10)) * num)
x[i] = num + (neg * percent)
}
The idea is that this should generate 200 points.
num is a random number between 0 and 500
neg is either -1 or 1, just to add some flexibility to the random offset (negative or positive offset of a randomly generated point)
percent is just a random percentage between 1% and 10% of the originally generated random number to either be added or subtracted
Very similar code that I've made in my main language, C#, works very well and generates proper plots. I'm more-or-less trying to port that code.
Whenever I run the above, I get the following errors (a lot of them):
number of items to replace is not a multiple of replacement length
It's triggered on the last line of code in the for loop.
I'd love to be able to fix this. Any help is appreciated. Thank you!
Chrisss has already pointed out your problem in his comment. However, you're doing a lot of bad things from an R programming prospective. The following approach is better:
N <- 200
d <- data.frame(x = rep(NA, N))
num <- runif(N, 0, 500)
neg <- sample(c(1,-1), 200, replace = TRUE) #jrdnmdhl pointed this out in his post
percent <- ((0.01 * runif(N, 1, 10)) * num)
d$x <- num + (neg * percent)
Why is this better? Two reasons, we are avoiding a for loop. R is a high-level language, and therefore, loops are slow. Second, you are not preallocating your memory. Skipping this step will slow things down as well. R has to go find more memory for each iteration in your example.
A great resource is Hadley Wickham's Advanced R, to learn more about the first and second reason, read this and that
The commenter mentioned the main problem, but your code would be much faster if vectorized. Also, your description of 'neg' is not consistent with what it is doing. Your code doesn't generate either -1 or 1. Instead, it generates either -1 or 0. The code below will generate either -1 or 1 for the neg variable.
num = runif(200, 0, 500)
neg = sample(c(1,-1),200,replace=T)
percent = ((0.01 * runif(200, 1, 10)) * num)
x = num + (neg * percent)

Optimizing the code for error minimization

I have written the code below for minimization of error by changing the value of alpha (using iteration method).
set.seed(16)
npoints = 10000
Y = round(runif(npoints), 3)
OY = sample(c(0, 1, 0.5), npoints, replace = T)
minimizeAlpha = function(Y, OY, alpha) {
PY = alpha*Y
error = OY - PY
squaredError = sapply(error, function(x) x*x)
sse = sum(squaredError)
return(sse)
}
# # Iterate for 10000 values
alphas = seq(0.0001, 1, 0.0001)
sse = sapply(alphas, function(x) minimizeAlpha(Y, OY, x))
print(alphas[sse == min(sse)])
I have used sapply for basic optimization. But, if the number of points are more than 10000 this code is running forever. So, is there any better way of implementation or any standard techniques to optimize (like Bisection). If so can you please help me in optimizing the code.
Note: I need the value of alpha with at least 4 decimals.
Any help is appreciated.
Replacing sapply instead of for isn’t more efficient, that’s a misconception. It’s merely often simpler code.
However, you can actually take advantage of vectorisation in your code — and that would be faster.
For instance, sapply(error, function(x) x*x) can simply be replaced by x * x. The sum of squared errors of numbers in R is thus simply sum((OY - PY) ** 2).
Your whole function thus boils down to:
minimizeAlpha = function(Y, OY, alpha)
sum((OY - alpha * Y) ** 2)
This should be more efficient — but first and foremost it’s better code and more readable.

Resources