Geometric mean functions returning Inf - r

Trying to solve a homework problem:
I have two functions to get the geometric mean from 1000 observations from the exponential distribution with a rate of .01. The following keeps returning Inf.
gmean <- function(n)
{
prod(n)^(1/length(n))
}
x<-rexp(1000,1/100)
gmean(x)
but this does not
gmean1 <- function(n)
{
exp(mean(log(n)))
}
x<-rexp(1000,1/100)
gmean1(x)
Why is this? I think it's something to do with the prod function but I'm not sure.

The problem is that when you do prod(n) in your function, it calculates the result of this call before raising it to the power of (1/length(n)). Since the mean of x is about 100, you can expect this call to return a value with a similar order of magnitude to 100^1000, which is much higher than the maximum number that R will return (R will call anything above around 10^308 Inf).
Any mathematical operation you attempt on Inf will also return Inf, so your naive implementation will not work if x is greater than about 154:
100^154
#> [1] 1e+308
100^155
#> [1] Inf
In actuality, because the majority of numbers are less than 100 in your sample, you might get to an x length of about 180 before you started generating Inf
In any case, it would be safer to stick to
gmean <- function(n) exp(sum(log(n))/length(n))

Related

Trying to plot loglikelihood of Cauchy distribution for different values of theta in R

I am trying to plot the log-likelihood function of the Cauchy distribution for varying values of theta (location parameter). These are my observations:
obs<-c(1.77,-0.23,2.76,3.80,3.47,56.75,-1.34,4.24,3.29,3.71,-2.40,4.53,-0.07,-1.05,-13.87,-2.53,-1.74,0.27,43.21)
Here is my log-likelihood function:
ll_c<-function(theta,x_values){
n<-length(x_values)
logl<- -n*log(pi)-sum(log(1+(x_values-theta)^2))
return(logl)
}
and Ive tried making a plot by using this code:
x<-seq(from=-10,to=10,by=0.1);length(x)
theta_null<-NULL
for (i in x){
theta_log<-ll_c(i,counts)
theta_null<-c(theta_null,theta_log)
}
plot(theta_null)
The graph does not look right and for some reason the length of x and theta_null differs.
I am assuming that theta is your location parameter (the scale is set to 1 in my example). You should obtain the same result using a t-distribution with 1 df and shifting the observations by theta. I left some comments in the code as guidance.
obs = c(1.77,-0.23,2.76,3.80,3.47,56.75,-1.34,4.24,3.29,3.71,-2.40,4.53,-0.07,-1.05,-13.87,-2.53,-1.74,0.27,43.21)
ll_c=function(theta, obs)
{
# Compute log-lik for obs and a value of thet (location)
logl= sum(dcauchy(obs, location = theta, scale = 1, log = T))
return(logl)
}
# Loop for possible values of theta(obs given)
x = seq(from=-10,to=10,by=0.1)
ll = NULL
for (i in x)
{
ll = c(ll, ll_c(i, obs))
}
# Plot log-lik vs possible value of theta
plot(x, ll)
It is hard to say exactly what you are experiencing without more info. But I'll make an educated guess.
First of all, we can simplify this a lot by using the *t family of functions for the t distribution, as the cauchy distribution is just the t distribution with df = 1. So your calculations could've been done using
for(i in ncp)
theta_null <- c(theta_null, sum(dt(values, 1, i, log = TRUE)))
Note that multiplying by n doesn't actually matter for any practical purposes. We are usually interested in minimizing/maximizing the likelihood in which case all constants are irrelevant.
Now if we use this approach, we can quite quickly notice something by printing the values:
print(head(theta_null))
[1] -Inf -Inf -Inf -Inf -Inf -Inf
So I am assuming what you are experiencing is that many of your values are "almost" negative infinity, and maybe these are not stored correctly in your outcome vector. I can't see that this should be the case from your code, but this would be my initial guess.

R median and ecdf() function giving different results - Why?

I have a vector vec with 80 values, if I apply the median(vec) function I get a value. However what I would like to do is the reverse, given a number estimate the percentile it belongs. I've found the ecdf() function, however I´m getting different results. This a simplified example
> vec = c(100,150,150,150,150,150,200)
> median(vec)
# This gives the expected result
[1] 150
# However if I go the other way around, meaning I pass the value and try to return the percentile I get:
rev_med <- ecdf(vec)
rev_med(150)
[1] 0.8571429
!!!
The behavior I'm expecting is passing 150 and get 50% as this is the median of the vector
What's going wrong here?
ecdf is giving the empirical CDF, which is a function F for which F(x) = P[X <= x] where X is the random variable producing the input vector vec.
It's an estimator; median is a different estimator.
But you can see that ecdf gives a reasonable answer:
mean(vec <= 150)
# [1] 0.8571429
Nevertheless, we can use the ecdf object to produce 150 as the median:
quantile(ecdf(vec), .5)
# 50%
# 150
See ?ecdf; this isn't a complete answer but hopefully it's illuminating anyway.

R histogram breaks Error

I have to prepare an algorithm for my thesis to cross check a theoretical result which is that the binomial model for N periods converges to lognormal distribution for N\to \infty. For those of you not familiar with the concept i have to create an algorithm that takes a starter value and multiplies it with an up-multiplier and a down multiplier and continues to do so for every value for N steps. The algorithm should return a vector whose elements are in the form of StarterValueu^id^{N-i} i=0,\dots,N
the simple algorithm i proposed is
rata<-function(N,r,u,d,S){
length(x)<-N
for(i in 0:N){
x[i]<-S*u^{i}*d^{N-i}
}
return(x)
}
N is the number of periods and the rest are just nonimportant values (u is for the up d for down etc)
In order to extract my results i need to make a histogram of the produced vector's logarithm to prove that they are normally distributed. However for a N=100000( i need an great number of steps to prove convergence) when i type hist(x) i get the error :(invalid number of breaks)
Can anyone help?? thanks in advance.
An example
taf<-rata(100000,1,1.1,0.9,1)
taf1<-log(taf)
hist(taf1,xlim=c(-400,400))
First I fix your function:
rata<-function(N,r,u,d,S){
x <- numeric(N+1)
for(i in 0:N){
x[i]<-S*u^{i}*d^{N-i}
}
return(x)
}
Or relying on vectorization:
rata<-function(N,r,u,d,S){
x<-S*u^{0:N}*d^{N-(0:N)}
return(x)
}
taf<-rata(100000,1,1.1,0.9,1)
Looking at the result, we notice that it contains NaN values:
taf[7440 + 7:8]
#[1] 0 NaN
What happened? Apparently the multiplication became NaN:
1.1^7448*0.9^(1e5-7448)
#[1] NaN
1.1^7448
#[1] Inf
0.9^(1e5-7448)
#[1] 0
Inf * 0
#[1] NaN
Why does an Inf value occur? Well, because of double overflow (read help("double")):
1.1^(7440 + 7:8)
#[1] 1.783719e+308 Inf
You have a similar problem with floating point precision when a multiplicant gets close to 0 (read help(".Machine")).
You may need to use arbitrary precision numbers.

R minimize absolute error

Here's my setup
obs1<-c(1,1,1)
obs2<-c(0,1,2)
obs3<-c(0,0,3)
absoluteError<-function(obs,x){
return(sum(abs(obs-x)))
}
Example:
> absoluteError(obs2,1)
[1] 2
For a random vector of observations, I'd like to find a minimizer, x, which minimizes the absolute error between the observation values and a vector of all x. For instance, clearly the argument that minimizes absoluteError(obs1,x) is x=1 because this results in an error of 0. How do I find a minimizer for a random vector of observations? I'd imagine this is a linear programming problem, but I've never implemented one in R before.
The median of obs is a minimizer for the absolute error. The following is a sketch of how one might try proving this:
Let the median of a set of n observations, obs, be m. Call the absolute error between obs and m f(obs,m).
Case n is odd:
Consider f(obs,m+delta) where delta is some non zero number. Suppose delta is positive - then there are (n-1)/2 +1 observations whose error is delta more than f(obs,m). The remaining (n-1)/2 observations' error is at most delta less than f(obs,m). So f(obs,m+delta)-f(obs,m)>=delta. (The same argument can be made if delta is negative.) So the median is the only minimizer in this case. Thus f(obs,m+delta)>f(obs,m) for any non zero delta so m is a minimizer for f.
Case n is even:
Basically the same logic as above, except in this case any number between the two inner most numbers in the set will be a minimizer.
I am not sure this answer is correct, and even if it is I am not sure this is what you want. Nevertheless, I am taking a stab at it.
I think you are talking about 'Least absolute deviations', a form of regression that differs from 'Least Squares'.
If so, I found this R code for solving Least absolute deviations regression:
fabs=function(beta0,x,y){
b0=beta0[1]
b1=beta0[2]
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
g=optim(c(1,1),fabs,x=x,y=y)
I found the code here:
http://www.stat.colostate.edu/~meyer/hw12ans.pdf
Assuming you are talking about Least absolute deviations, you might not be interested in the above code if you want a solution in R from scratch rather than a solution that uses optim.
The above code is for a regression line with an intercept and one slope. I modified the code as follows to handle a regression with just an intercept:
y <- c(1,1,1)
x <- 1:length(y)
fabs=function(beta0,x,y){
b0=beta0[1]
b1=0
n=length(x)
llh=0
for(i in 1:n){
r2=(y[i]-b0-b1*x[i])
llh=llh + abs(r2)
}
llh
}
# The commands to get the estimator
g = optim(c(1),fabs,x=x,y=y, method='Brent', lower = (min(y)-5), upper = (max(y)+5))
g
I was not familiar with (i.e., had not heard of) Least absolute deviations until tonight. So, hopefully my modifications are fairly reasonable.
With y <- c(1,1,1) the parameter estimate is 1 (which I think you said is the correct answer):
$par
[1] 1
$value
[1] 1.332268e-15
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,1,2) the parameter estimate is 1:
$par
[1] 1
$value
[1] 2
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
With y <- c(0,0,3) the parameter estimate is 0 (which you said is the correct answer):
$par
[1] 8.613159e-10
$value
[1] 3
$counts
function gradient
NA NA
$convergence
[1] 0
$message
NULL
If you want R code from scratch, there is additional R code in the file at the link above which might be helpful.
Alternatively, perhaps it might be possible to extract the relevant code from the source file.
Alternatively, perhaps someone else can provide the desired code (and correct any errors on my part) in the next 24 hours.
If you come up with code from scratch please post it as an answer as I would love to see it myself.
lad=function(x,y){
SAD = function(beta, x, y) {
return(sum(abs(y - (beta[1] + beta[2] * x))))
}
d=lm(y~x)
ans1 = optim(par=c(d$coefficients[1], d$coefficients[2]),method = "Nelder-Mead",fn=SAD, x=x, y=y)
coe=setNames(ans1$par,c("(Intercept)",substitute(x)))
fitted=setNames(ans1$par[1]+ans1$par[2]*x,c(1:length(x)))
res=setNames(y-fitted,c(1:length(x)))
results = list(coefficients=coe, fitted.values=fitted, residuals=res)
class(results)="lad"
return(results)
}

Fastest way to sample real values using a proportional probability

Given a numeric vector with N real numbers, what's the fastest way to sample k values, such that higher values have greater probability of being selected?
mathematically
prob(X) > prob(Y) when X > Y (Linearly)
This is easy with sample() when all entries are positive, just use the prob arg:
N = 1000
k = 600
x = runif(N, 0, 10)
results = sample(x, k, replace = TRUE, prob = x)
But it does'n work in my case, because some values might be negative. I cannot drop or ignore negative values, that's the problem.
So, what's the fastest (code speed) way of doing this? Obviously i know how to solve this, the issue is code speed - one method should be slower than other i guess:
1 - Normalize the x vector (a call to `range()` would be necessary + division)
2 - Sum max(x) to x (a call to `max()` then sum)
Thanks.
A few comments. First, it's still not exactly clear what you want. Obviously, you want larger numbers to be chosen with higher probability, but there are a lot of ways of doing this. For example, either rank(x) or x-min(x) will produce a vector of non-negative weights which are monotonic in x.
Another point, you don't need to normalize the weights, because sample will do that for you, provided that the weights are non-negative:
> set.seed(1)
> sample(1:10,prob=1:10)
[1] 9 8 6 2 10 3 1 5 7 4
> set.seed(1)
> sample(1:10,prob=(1:10)/sum(1:10))
[1] 9 8 6 2 10 3 1 5 7 4
On edit: The OP is now asking for a weighting function which is "linear" in the input vector. Technically this is impossible, because linear functions are of the form f(X)=cX, so if a vector x contains both positive and negative values, then any linear function of x will also contain both positive and negative values, unless c=0, in which case it still does not give a valid vector of probability weights.
I think what you mean by "linear" is simply x-min(x). This is not a linear function, but an affine function. Moreover, even if you had specified that you wanted P(X) to vary as an affine function of X, that still would not have uniquely determined the probability weights, because there are an infinite number of possible affine functions that would yield valid weights (e.g. x-min(x)+1, etc.)
In any case, assuming x-min(x) is what you want, the question now becomes, what is the fastest way to compute x-min(x) in R. And I'm pretty sure that the answer is just x-min(x).
Finally, for constants anywhere near what you have in your example, there is not much point in trying to optimize the calculation of weights, because the random sampling is going to take much longer anyway. For example:
> x<-rnorm(1000)
> k<-600
> p<-x-min(x)
> microbenchmark(x-min(x),sample(x,k,T,p))
Unit: microseconds
expr min lq median uq max neval
x - min(x) 6.56 6.9105 7.0895 7.2515 13.629 100
sample(x, k, T, p) 50.30 51.4360 51.7695 52.1970 66.196 100

Resources