Data approximation with maxLik library. NaNs produced - r

I'm trying to make a data approximation for quakes dataset, stations column. I'm following an example in the maxLik library documentation https://cran.r-project.org/web/packages/maxLik/maxLik.pdf
library(maxLik)
set.seed(1)
attach(quakes)
## log likelihood function.
## Note: 􏰀param􏰀 is a vector
llf <- function( param ) {
mu <- param[ 1 ]
sigma <- param[ 2 ]
llValue <- dnorm(stations, mean=mu, sd=sigma, log=TRUE)
return(sum(llValue))
}
## Estimate it. Take standard normal as start values
ml <- maxLik( llf, start = c(mu=0, sigma=1) )
I'm getting In dnorm(stations, mean = mu, sd = sigma, log = TRUE) : NaNs produced warning. This happens because sum(llValue) returns negative values. As a result, the approximation is very bad. If I change the code to return(abs(sum(llValue))) the warning disappear and the ​approximation is a bit closer, but still very bad.
What I'm doing wrong? How to make an approximation for quakes$stations data?

The problem was in the wrong distribution for llValue. I tried to approximate lognormal distribution using the normal distribution.
Stations from quakes dataset could be approximated with following function
library(maxLik)
attach(quakes)
llf_dlnorm <- function(param) {
mu <- param[ 1 ]
sigma <- param[ 2 ]
llValue <- dlnorm(stations, mean=mu, sd=sigma, log=TRUE)
return(sum(llValue))
}
ml_dlnorm <- maxLik(llf_dlnorm, start = c(mu=mean(stations), sigma=sd(stations)))
x<-seq(0, 150, by=0.01)
hist(stations, prob=T)
lines(x,dlnorm(x, ml_dlnorm$estimate[1], ml_dlnorm$estimate[2]), col="blue",lwd=2)

Related

nonlinear least square algorithm code with R

I have a question on minimizing the sum of squared residuals to estimate "theta" in the below regression function. I intend not to use any built-in functions or packages in R, and write the iterative algorithm.
The regression function is: y_k=exp(-theta |x_k|)+e_k, for k=1,...,n
Here is my code, but it gives me the following error for some sets of x and y. Thanks in advance for your suggestions!
Error in if (abs(dif) < 10^(-5)) break :
missing value where TRUE/FALSE needed"
Code:
theta <- -sum(log(abs(y)))/sum(abs(x))
#Alg:
rep <- 1
while(rep<=1000){
Ratio <- sum((abs(x)*exp(-theta*abs(x)))*(y-exp(-theta*abs(x))))/
sum((abs(x)^2*exp(-theta*abs(x)))*(y-2*exp(-theta*abs(x))))
if(is.na(Ratio)){
thetanew <- theta
}
else{
thetanew <- theta+Ratio
}
dif <- thetanew-theta
theta <- thetanew
if(abs(dif)<10^(-5)) break
rep=rep+1
}

how to draw the log-likelihood graph

I am learning how to draw a log-likelihood graph. Please allow me briefly introduce what I want to do specifically:
Assume we have the data/vector as below:
set.seed(123)
sample <- rpois(50, 1.65)
And the log_like function is given as below:
log_like_graph <- function(lambda){
X <- as.matrix(sample) # not sure whether this is necessary for one-parameter distribution.
N <- nrow(X)
logLik <- N*log(lambda) - lambda*N*mean(X)
return(loglik)
}
log_like_graph <- Vectorize(log_like_graph)
# set range of lambda
lambda_vals <- seq(-10,10,by=1)
log_vals <- outer(lambda_vals,log_like_graph)
Based on the above lambda_vals and log_vals, I expect to produce a plot like below:
However, when I excute the last command: log_vals <- outer(lambda_vals,log_like_graph), I got the error hint
Error in as.vector(x, mode) :
cannot coerce type 'closure' to vector of type 'any'
Could you please help me solve this problem? Thank you very much!
(FYI: I mainly follow the youtube video https://www.youtube.com/watch?v=w3drLH-DFpE&ab_channel=CalebLikesR that teaches to draw the curve for a log-likelihood function, although it uses normal distribution for demonstration.)
A couple of things I see; no need to vectorise log_like_graph as you can just pass lambda values into it with sapply rather than outer, you are passing lambda_vals < 0 but the support of lambda is >= 0, and I don't think your log-likelihood function is correct (I think it should be -N * lambda - sum(lfactorial(sample)) + log(lambda) * sum(X) but it is easier/more accurate to use dpois(..., log=TRUE)).
So fixing these things
# data
set.seed(123)
samples <- rpois(50, 1.65)
# The log-likelihood becomes
log_like_graph <- function(X, lambda){
N <- NROW(X)
logLik <- -N * lambda - sum(lfactorial(X)) + log(lambda) * sum(X)
return(logLik)
}
# set lambda >= 0 and take smaller steps (0.01) for a smoother curve
lambda_vals <- seq(0,10,by=0.01)
# loop through lambda values calculating the log-likehood at each value
ll1 <- sapply(lambda_vals, function(i) log_like_graph(samples, i))
plot(lambda_vals, ll1, type="l")
This can also be done with dpois(..., log=TRUE) :
ll2 <- sapply(lambda_vals, function(i) sum(dpois(samples, lambda=i, log=TRUE)))
all.equal(ll1, ll2)
# [1] TRUE

Problems fitting a log pearson III with negative scale

I'd like to perform a log Pearson III fit to some data points I have in R. I followed this guidelines
link. But i encounter a problem when my skewness (g) is negative (and of course the parameter "scale" is negative too, because the "sign(g)" in the computation of scale). The distribution from "FAdist" does not work with negative scale parameter, i need this for starting values for the "fitdist" (from fitdistrplus). Some pages say that the parameters shape and scale is only positive in a pearson III (or generalize gamma distribution) and other dont, i run out of ideas. The code is:
library(fitdistrplus)
library(FAdist)
library(e1071)
#data
df <-(92.8, 53.2, 112.0, 164.0, 132.0, 69.9, 140.0, 48.3, 123.0 ,24.6, 179.0, 55.1, 31.3, 17.0, 111.0, 35.4, 133.0, 505.0, 303.0, 121.5, 203.0, 198.0, 250.0, 232.0, 185.0, 222.0, 191.0, 238.0, 53.0, 121.0, 106.4, 347.3, 186.4, 89.1, 131.4 ,53.2, 252.6)
# log of df
df<-log(df)
#Pearson 3 Sample moments
m <- mean(df)
v <- var(df)
s <- sd(df)
g <- e1071::skewness(df, type=1)
n <- length(df)
#Pearson 3 Parameter estimation
my.shape <- (2/g)^2
my.scale <- sqrt(v)/sqrt(my.shape)*sign(g) # modified as recommended by Carl Schwarz, this is negative =(
my.thres <- m-(my.shape*my.scale)
# All parameter together
my.param <- list(shape=my.shape, scale=my.scale, thres=my.thres )
# fit dist from the "fitdistrplus" and "lgamma3" from "FAdist"
q.p3 <- fitdist(data = caudales,distr = "lgamma3", start = my.param)
Give me the following
Error in fitdist(data = df, distr = "lgamma3", start = my.param, :
the function mle failed to estimate the parameters,
with the error code 100
I already solve the problem. I used the package "PearsonDS" instead of "FAdist" to compute log pearson III, why? because the latter accept negative values of skewness (g) and in consecuence negative values of "scale" parameter (if u look at the documentation of PearsonDS in Pearson III, they add an absolute value in the argument). I had to do a little changes to the equation of Pearson III to work right with package "fitdistrplus" (to fit distribution to sample data):
The final code is right this (IMPORTANT : DATA MUST BE THE ORIGINAL VALUES NOT THE LOG(DATA)), BUT MEAN,VAR AND G THEY NEED TO BE COMPUTED WITH LOG(DATA):
library(FAdist); library(fitdistrplus)
library("e1071")
library(PearsonDS)
load("data")
## PP for precipitation
pp <- as.numeric(data$pp)
## IMPORTANT : DATA MUST BE THE ORIGINAL VALUES NOT THE LOG(DATA)
data <- (pp)
m <- mean(log(data))
v <- var(log(data))
s <- sd(log(data))
g <- e1071::skewness(log(data), type=1)
n <- length(data)
#g <- g*(sqrt(n*(n-1))/(n-2))*(1+8.5/n)
my.shape <- (2/g)^2
my.scale <- (sqrt(v/my.shape))*sign(g) # modified as recommended by Carl Schwarz
my.thres <- m-my.shape*my.scale
#load functions
dlPIII<-function(x, shape, location, scale) {PearsonDS::dpearsonIII(log(x), shape, location, scale, log=FALSE)/x}
plPIII<-function(q, shape, location, scale) {PearsonDS::ppearsonIII(log(q), shape, location, scale, lower.tail = TRUE, log.p = FALSE)}
qlPIII<-function(p, shape, location, scale) {exp(PearsonDS::qpearsonIII(p, shape, location, scale, lower.tail = TRUE, log.p = FALSE))}
# Fit the distribution
f.pl3 <- fitdist(data, distr="lPIII", method="mle", start=list(shape=my.shape,location=my.thres ,scale=my.scale )

How to run monte carlo simulation from a custom distribution in R

I would like to pull 1000 samples from a custom distribution in R
I have the following custom distribution
library(gamlss)
mu <- 1
sigma <- 2
tau <- 3
kappa <- 3
rate <- 1
Rmax <- 20
x <- seq(1, 2e1, 0.01)
points <- Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) * pgamma(x, shape = kappa, rate = rate)
plot(points ~ x)
How can I randomly sample via Monte Carlo simulation from this distribution?
My first attempt was the following code which produced a histogram shape I did not expect.
hist(sample(points, 1000), breaks = 51)
This is not what I was looking for as it does not follow the same distribution as the pdf.
If you want a Monte Carlo simulation, you'll need to sample from the distribution a large number of times, not take a large sample one time.
Your object, points, has values that increases as the index increases to a threshold around 400, levels off, and then decreases. That's what plot(points ~ x) shows. It may describe a distribution, but the actual distribution of values in points is different. That shows how often values are within a certain range. You'll notice your x axis for the histogram is similar to the y axis for the plot(points ~ x) plot. The actual distribution of values in the points object is easy enough to see, and it is similar to what you're seeing when sampling 1000 values at random, without replacement from an object with 1900 values in it. Here's the distribution of values in points (no simulation required):
hist(points, 100)
I used 100 breaks on purpose so you could see some of the fine details.
Notice the little bump in the tail at the top, that you may not be expecting if you want the histogram to look like the plot of the values vs. the index (or some increasing x). That means that there are more values in points that are around 2 then there are around 1. See if you can look at how the curve of plot(points ~ x) flattens when the value is around 2, and how it's very steep between 0.5 and 1.5. Notice also the large hump at the low end of the histogram, and look at the plot(points ~ x) curve again. Do you see how most of the values (whether they're at the low end or the high end of that curve) are close to 0, or at least less than 0.25. If you look at those details, you may be able to convince yourself that the histogram is, in fact, exactly what you should expect :)
If you want a Monte Carlo simulation of a sample from this object, you might try something like:
samples <- replicate(1000, sample(points, 100, replace = TRUE))
If you want to generate data using points as a probability density function, that question has been asked and answered here
Let's define your (not normalized) probability density function as a function:
library(gamlss)
fun <- function(x, mu = 1, sigma = 2, tau = 3, kappa = 3, rate = 1, Rmax = 20)
Rmax * dexGAUS(x, mu = mu, sigma = sigma, nu = tau) *
pgamma(x, shape = kappa, rate = rate)
Now one approach is to use some MCMC (Markov chain Monte Carlo) method. For instance,
simMCMC <- function(N, init, fun, ...) {
out <- numeric(N)
out[1] <- init
for(i in 2:N) {
pr <- out[i - 1] + rnorm(1, ...)
r <- fun(pr) / fun(out[i - 1])
out[i] <- ifelse(runif(1) < r, pr, out[i - 1])
}
out
}
It starts from point init and gives N draws. The approach can be improved in many ways, but I'm simply only going to start form init = 5, include a burnin period of 20000 and to select every second draw to reduce the number of repetitions:
d <- tail(simMCMC(20000 + 2000, init = 5, fun = fun), 2000)[c(TRUE, FALSE)]
plot(density(d))
You invert the ECDF of the distribution:
ecd.points <- ecdf(points)
invecdfpts <- with( environment(ecd.points), approxfun(y,x) )
samp.inv.ecd <- function(n=100) invecdfpts( runif(n) )
plot(density (samp.inv.ecd(100) ) )
plot(density(points) )
png(); layout(matrix(1:2,1)); plot(density (samp.inv.ecd(100) ),main="The Sample" )
plot(density(points) , main="The Original"); dev.off()
Here's another way to do it that draws from R: Generate data from a probability density distribution and How to create a distribution function in R?:
x <- seq(1, 2e1, 0.01)
points <- 20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)
f <- function (x) (20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1))
C <- integrate(f,-Inf,Inf)
> C$value
[1] 11.50361
# normalize by C$value
f <- function (x)
(20*dexGAUS(x,mu=1,sigma=2,nu=3)*pgamma(x,shape=3,rate=1)/11.50361)
random.points <- approx(cumsum(pdf$y)/sum(pdf$y),pdf$x,runif(10000))$y
hist(random.points,1000)
hist((random.points*40),1000) will get the scaling like your original function.

Generating samples from a two-Gaussian mixture in r (code given in MATLAB)

I'm trying to create (in r) the equivalent to the following MATLAB function that will generate n samples from a mixture of N(m1,(s1)^2) and N(m2, (s2)^2) with a fraction, alpha, from the first Gaussian.
I have a start, but the results are notably different between MATLAB and R (i.e., the MATLAB results give occasional values of +-8 but the R version never even gives a value of +-5). Please help me sort out what is wrong here. Thanks :-)
For Example:
Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one.
MATLAB
function
function y = gaussmix(n,m1,m2,s1,s2,alpha)
y = zeros(n,1);
U = rand(n,1);
I = (U < alpha)
y = I.*(randn(n,1)*s1+m1) + (1-I).*(randn(n,1)*s2 + m2);
implementation
P = gaussmix(1000,0,0,1,6,.95)
P = (P-mean(P))/std(P)
plot(P)
axis([0 1000 -15 15])
hist(P)
axis([-15 15 0 1000])
resulting plot
resulting hist
R
yn <- rbinom(1000, 1, .95)
s <- rnorm(1000, 0 + 0*yn, 1 + 36*yn)
sn <- (s-mean(s))/sd(s)
plot(sn, xlim=range(0,1000), ylim=range(-15,15))
hist(sn, xlim=range(-15,15), ylim=range(0,1000))
resulting plot
resulting hist
As always, THANK YOU!
SOLUTION
gaussmix <- function(nsim,mean_1,mean_2,std_1,std_2,alpha){
U <- runif(nsim)
I <- as.numeric(U<alpha)
y <- I*rnorm(nsim,mean=mean_1,sd=std_1)+
(1-I)*rnorm(nsim,mean=mean_2,sd=std_2)
return(y)
}
z1 <- gaussmix(1000,0,0,1,6,0.95)
z1_standardized <- (z1-mean(z1))/sqrt(var(z1))
z2 <- gaussmix(1000,0,3,1,1,0.80)
z2_standardized <- (z2-mean(z2))/sqrt(var(z2))
z3 <- rlnorm(1000)
z3_standardized <- (z3-mean(z3))/sqrt(var(z3))
par(mfrow=c(2,3))
hist(z1_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 95% of N(0,1) and 5% of N(0,36)",
col="blue",xlab=" ")
hist(z2_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of 80% of N(0,1) and 10% of N(3,1)",
col="blue",xlab=" ")
hist(z3_standardized,xlim=c(-10,10),ylim=c(0,500),
main="Histogram of samples of LN(0,1)",col="blue",xlab=" ")
##
plot(z1_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(0,36)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z2_standardized,type='l',
main="1000 samples from a mixture N(0,1) and N(3,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
plot(z3_standardized,type='l',
main="1000 samples from LN(0,1)",
col="blue",xlab="Samples",ylab="Mean",ylim=c(-10,10))
There are two problems, I think ... (1) your R code is creating a mixture of normal distributions with standard deviations of 1 and 37. (2) By setting prob equal to alpha in your rbinom() call, you're getting a fraction alpha in the second mode rather than the first. So what you are getting is a distribution that is mostly a Gaussian with sd 37, contaminated by a 5% mixture of Gaussian with sd 1, rather than a Gaussian with sd 1 that is contaminated by a 5% mixture of a Gaussian with sd 6. Scaling by the standard deviation of the mixture (which is about 36.6) basically reduces it to a standard Gaussian with a slight bump near the origin ...
(The other answers posted here do solve your problem perfectly well, but I thought you might be interested in a diagnosis ...)
A more compact (and perhaps more idiomatic) version of your Matlab gaussmix function (I think runif(n)<alpha is slightly more efficient than rbinom(n,size=1,prob=alpha) )
gaussmix <- function(n,m1,m2,s1,s2,alpha) {
I <- runif(n)<alpha
rnorm(n,mean=ifelse(I,m1,m2),sd=ifelse(I,s1,s2))
}
set.seed(1001)
s <- gaussmix(1000,0,0,1,6,0.95)
Not that you asked for it, but the mclust package offers a way to generalize your problem to more dimensions and diverse covariance structures. See ?mclust::sim. The example task would be done this way:
require(mclust)
simdata = sim(modelName = "V",
parameters = list(pro = c(0.95, 0.05),
mean = c(0, 0),
variance = list(modelName = "V",
d = 1,
G = 2,
sigmasq = c(0, 36))),
n = 1000)
plot(scale(simdata[,2]), type = "h")
I recently wrote the density and sampling function of a multinomial mixture of normal distributions:
dmultiNorm <- function(x,means,sds,weights)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
N <- length(x)
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
weights <- weights/sum(weights)
dens <- numeric(N)
for (i in 1:n)
{
dens <- dens + weights[i] * dnorm(x,means[i],sds[i])
}
return(dens)
}
rmultiNorm <- function(N,means,sds,weights,scale=TRUE)
{
if (length(means)!=length(sds)) stop("Length of means must be equal to length of standard deviations")
n <- length(means)
if (missing(weights))
{
weights <- rep(1,n)
}
if (length(weights)!=n) stop ("Length of weights not equal to length of means and sds")
Res <- numeric(N)
for (i in 1:N)
{
s <- sample(1:n,1,prob=weights)
Res[i] <- rnorm(1,means[s],sds[s])
}
return(Res)
}
With means being a vector of means, sds being a vector of standard deviatians and weights being a vector with proportional probabilities to sample from each of the distributions. Is this useful to you?
Here is code to do this task:
"For Example: Plot 1000 samples from a mix of N(0,1) and N(0,36) with 95% of samples from the first Gaussian. Normalize the samples to mean zero and standard deviation one."
plot(multG <- c( rnorm(950), rnorm(50, 0, 36))[sample(1000)] , type="h")
scmulG <- scale(multG)
summary(scmulG)
#-----------
V1
Min. :-9.01845
1st Qu.:-0.06544
Median : 0.03841
Mean : 0.00000
3rd Qu.: 0.13940
Max. :12.33107

Resources