Fit distribution to given frequency values in R - r

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)

Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)

You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))

Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

Related

confidence interval around predicted value from complex inverse function

I'm trying to get a 95% confidence interval around some predicted values, but am not capable of achieving this.
Basically, I estimated a growth curve like this:
set.seed(123)
dat=data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
model <- nls(size~sommers(age,Linf,K,t0,ts,C),data=dat,
start=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1))
I have independent size measurements, for which I would like to predict the age. Therefore, the inverse of the function, which is not very straightforward, I calculated like this:
model.out=coef(model)
S.out <- function(t)
((model.out[[4]]*model.out[[2]])/(2*pi))*sin(2*pi*(t-model.out[[5]]))
sommers.out <- function(t)
model.out[[1]]*(1-exp(-model.out[[2]]*(t-model.out[[3]])-S.out(t)+S.out(model.out[[3]])))
inverse = function (f, lower = -100, upper = 100) {
function (y) uniroot((function (x) f(x) - y), lower = lower, upper = upper)[1]
}
sommers.inverse = inverse(sommers.out, 0, 25)
x= sommers.inverse(10) #this works with my complete dataset, but not with this fake one
Although this works fine, I need to know the confidence interval (95%) around this estimate (x). For linear models there is for example "predict(... confidence=)". I could also bootstrap the function somehow to get the quantiles associated with the parameters (didn't find how), to then use the extremes of those to calculate the maximum and minimum values predictable. But that doesn't really look like the good way of doing this....
Any help would be greatly appreciated.
EDIT after answer:
So this worked (explained in the book of Ben Bolker, see answer):
vmat = mvrnorm(1000, mu = coef(mfit), Sigma = vcov(mfit))
dist = numeric(1000)
for (i in 1:1000) {dist[i] = sommers_inverse(9.938,vmat[i,])}
quantile(dist, c(0.025, 0.975))
On the rather bad fake data I gave, this works of course rather horrible. But on the real data (which I have a problem recreating), this is ok!
Unless I'm mistaken, you're going to have to use either regular (parametric) bootstrapping or a method called either "population predictive intervals" (e.g., see section 5 of chapter 7 of Bolker 2008), which assumes that the sampling distributions of your parameters are multivariate Normal. However, I think you may have bigger problems, unless I've somehow messed up your model in adapting it ...
Generate data (note that random data may actually bad for testing your model - see below ...)
set.seed(123)
dat <- data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
Plot the data and the initial curve estimate:
plot(size~age,data=dat,ylim=c(0,16))
agevec <- seq(0,10,length=1001)
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
I had trouble with nls so I used minpack.lm::nls.lm, which is slightly more robust. (There are other options here, e.g. calculating the derivatives and providing the gradient function, or using AD Model Builder or Template Model Builder, or using the nls2 package.)
For nls.lm we need a function that returns the residuals:
sommers_fn <- function(par,dat) {
with(c(as.list(par),dat),size-sommers(age,Linf,K,t0,ts,C))
}
library(minpack.lm)
mfit <- nls.lm(fn=sommers_fn,
par=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1),
dat=dat)
coef(mfit)
## Linf K t0 C ts
## 10.6540185 0.3466328 2.1675244 136.7164179 0.3627371
Here's our problem:
plot(size~age,data=dat,ylim=c(0,16))
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
with(as.list(coef(mfit)), {
lines(agevec,sommers(agevec,Linf,K,t0,ts,C),col=2)
abline(v=t0,lty=2)
abline(h=c(0,Linf),lty=2)
})
With this kind of fit, the results of the inverse function are going to be extremely unstable, as the inverse function is many-to-one, with the number of inverse values depending sensitively on the parameter values ...
sommers_pred <- function(x,pars) {
with(as.list(pars),sommers(x,Linf,K,t0,ts,C))
}
sommers_pred(6,coef(mfit)) ## s(6)=9.93
sommers_inverse <- function (y, pars, lower = -100, upper = 100) {
uniroot(function(x) sommers_pred(x,pars) -y, c(lower, upper))$root
}
sommers_inverse(9.938, coef(mfit)) ## 0.28
If I pick my interval very carefully I can get back the correct answer ...
sommers_inverse(9.938, coef(mfit), 5.5, 6.2)
Maybe your model will be better behaved with more realistic data. I hope so ...

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Finding distribution of sample mean by central limit theorem

let X1,...,X25 be a random sampe from normal distribution with mean=37 and sd=45.
Let xbar be the sample mean. How is xbar distributed? I have to verify it by central limit theorem.
Also compute P(xbar>43.1)
my attempt
for(i in 1:1000){
x=rnorm(25,mean=37,sd=45)
xbar=mean(x)
z=(xbar-37)/(45/sqrt(25))
}
z
But i couldn't find the distribution of xbar.
Change your for loop and use replicate instead
set.seed(1)
X <- replicate(1000, rnorm(25,mean=37,sd=45))
X_bar <- colMeans(X)
hist(X_bar) # this is how the distribution of X_bar looks like
xbar=c()
for(i in 1:1000){
x=rnorm(25,mean=37,sd=45)
xbar=c(xbar,mean(x)) #save every time the value of xbar
}
hist(xbar) #plot the hist of xbar
#compute the probability to b e bigger thant 43.1
prob=which(xbar>43.1)/length(xbar)
Just to expand in this a little bit.
The Central Limit Theorem states the distribution of the mean is asymptotically N[mu, sd/sqrt(n)]. Where mu and sd are the mean and standard deviation of the underlying distribution, and n is the sample size used in calculating the mean. So, in the example below data is a dataset of size 2500 drawn from N[37,45], arbitrarily segmented into 100 groups of 25. means is a dataset of the means of each group. Note that both the data and the means are (aprox.) normally distributed, but the distribution of the means is much tighter (lower sigma). From the CLT we expect sd(mean) ~ sd(data)/sqrt(25), which it is.
data <- data.frame(sample=rep(1:100,each=25),x = rnorm(2500,mean=37,sd=45))
means <- aggregate(data$x,by=list(data$sample),mean)
#plot histoggrams
par(mfrow=c(1,2))
hist(data$x,main="",sub="Histogram of Underlying Data",xlim=c(-150,200))
hist(means$x,main="",sub="Histogram of Means", xlim=c(-150,200))
mtext("Underlying Data ~ N[37,45]",outer=T,line=-3)
c(sd.data=sd(data$x), sd.means=sd(means$x))
sd.data sd.means
43.548570 7.184518
But the real power of the CLT is that it shows that the distribution of the means is asymptotically normal, regardless of the distribution of the underlying data. This is shown here, where the underlying data is sampled from a uniform distribution. Again, sd(mean) ~ sd(data)/sqrt(25).
data <- data.frame(sample=rep(1:100,each=25),x = runif(2500,min=-150, max=200))
means <- aggregate(data$x,by=list(data$sample),mean)
#plot histoggrams
par(mfrow=c(1,2))
hist(data$x,main="",sub="Histogram of Underlying Data",xlim=c(-150,200))
hist(means$x,main="",sub="Histogram of Means", xlim=c(-150,200))
mtext("Underlying Data ~ U[-150,200]",outer=T,line=-3)
c(sd.data=sd(data$x), sd.means=sd(means$x))
sd.data sd.means
99.7800 18.8176

Observation in a bivariate Ellipse

I am trying find the probability that a point lies within an ellipse?
For eg if I was plotting the bivariate data (x,y) for 300 datasets in an 95% ellipsoid region, how do I calculate how many times out of 300 will my points fall inside the
ellipse?
Heres the code I am using
library(MASS)
seed<-1234
x<-NULL
k<-1
Sigma2 <- matrix(c(.72,.57,.57,.46),2,2)
Sigma2
rho <- Sigma2[1,2]/sqrt(Sigma2[1,1]*Sigma2[2,2])
rho
eta1<-replicate(300,mvrnorm(k, mu=c(-1.59,-2.44), Sigma2))
library(car)
dataEllipse(eta1[1,],eta1[2,], levels=c(0.05, 0.95))
Thanks for your help.
I don't see why people are jumping on the OP. In context, it's clearly a programming question: it's about getting the empirical frequency of data points within a given ellipse, not a theoretical probability. The OP even posted code and a graph showing what they're trying to obtain.
It may be that they don't fully understand the statistical theory behind a 95% ellipse, but they didn't ask about that. Besides, making plots and calculating frequencies like this is an excellent way of coming to grips with the theory.
Anyway, here's some code that answers the narrowly-defined question of how to count the points within an ellipse obtained via a normal distribution (which is what underlies dataEllipse). The idea is to transform your data to the unit circle via principal components, then get the points within a certain radius of the origin.
within.ellipse <- function(x, y, plot.ellipse=TRUE)
{
if(missing(y) && is.matrix(x) && ncol(x) == 2)
{
y <- x[,2]
x <- x[,1]
}
if(plot.ellipse)
dataEllipse(x, y, levels=0.95)
d <- scale(prcomp(cbind(x, y), scale.=TRUE)$x)
rad <- sqrt(2 * qf(.95, 2, nrow(d) - 1))
mean(sqrt(d[,1]^2 + d[,2]^2) < rad)
}
It was also commented that a 95% data ellipse contains 95% of the data by definition. This is certainly not true, at least for normal-theory ellipses. If your distribution is particularly bad, the coverage frequency may not even converge to the assumed level as the sample size increases. Consider a generalised pareto distribution, for example:
library(evd) # for rgpd
# generalised pareto has no variance for shape > 0.5
z <- sapply(1:1000, function(...) within.ellipse(rgpd(100, shape=5), rgpd(100, shape=5), FALSE))
mean(z)
[[1] 0.97451
z <- sapply(1:1000, function(...) within.ellipse(rgpd(10000, shape=5), rgpd(10000, shape=5), FALSE))
mean(z)
[1] 0.9995808

Calculate the derivative of a data-function in r

Is there an easy way to calculate the derivative of non-liner functions that are give by data?
for example:
x = 1 / c(1000:1)
y = x^-1.5
ycs = cumsum(y)
plot (x, ycs, log="xy")
How can I calculate the derivative function from the function given by ´x´ and ´ycs´?
Was also going to suggest an example of a smoothed spline fit followed by prediction of the derivative. In this case, the results are very similar to the diff calculation described by #dbaupp:
spl <- smooth.spline(x, y=ycs)
pred <- predict(spl)
plot (x, ycs, log="xy")
lines(pred, col=2)
ycs.prime <- diff(ycs)/diff(x)
pred.prime <- predict(spl, deriv=1)
plot(ycs.prime)
lines(pred.prime$y, col=2)
Generating derivatives from raw data is risky unless you are very careful. Not for nothing is this process known as "error multiplier." Unless you know the noise content of your data and take some action (e.g. spline) to remove the noise prior to differentiation, you may well end up with a scary curve indeed.
The derivative of a function is dy/dx, which can be approximated by Δy/Δx, that is, "change in y over change in x". This can be written in R as
ycs.prime <- diff(ycs)/diff(x)
and now ycs.prime contains an approximation to the derivative of the function at each x: however it is a vector of length 999, so you will need to shorten x (i.e. use x[1:999] or x[2:1000]) when doing any analysis or plotting.
There is also gradient from the pracma package.
grad <- pracma::gradient(ycs, h1 = x)
plot(grad, col = 1)

Resources