How to estimate lambdas of poisson distributed samples in R and to draw Kernel estimation of the density function of the estimator basing on that? - r

So I have 500 poisson distributed simulated samples with n=100 each.
1) How can I estimate the lambdas for each of these samples separately in R ?
2) How can I draw Kernel Estimation of the density function of the estimator for lambda based on the 500 estimated lambdas? (my guess is somehow with "Kernsmooth" package and function "bkfe" but i fail to programm it normally anyway
taskpois <- function(size, leng){
+ taskmlepois <- NULL
+ for (i in 1:leng){
+ randompois <- rpois(size, 6)
+ taskmlepois[i] <- mean(randompois)
+ }
+ return(taskmlepois)
+ }
tasksample <- taskpois(size=100, leng=500)

As the comments suggest, it seems you're pretty close already.
ltarget <- 2
set.seed(101)
lambdavec <- replicate(500,mean(rpois(100,lambda=ltarget)))
dd <- density(lambdavec)
plot(dd,main="",las=1,bty="l")
We might as well add the expected result based on asymptotic theory:
curve(dnorm(x,mean=2,sd=sqrt(2/100)),add=TRUE,col=2)
We can add another line that shows that the variation among the densities of different experiments is pretty large relative to the difference between the theoretical and observed density from the first experiment:
lambdavec2 <- replicate(500,mean(rpois(100,lambda=ltarget)))
lines(density(lambdavec2),col=4)

Related

Realize 1000 samples of not i.i.d normally distributed variables in R

I want to generate 1000 samples of X_1....𝑋_1000 in R where it is known that
X_t
=0.9*𝑋_(t−1)+ error, where the errors are IID standard normally distributed, and that 𝑋1 is 𝑁(0,100/19)
I have shown that each 𝑋𝑖 has the distribution 𝑁(0,100/19)and that they are not independent.
How does one generate this sample realization in R?
You do not say how big the standard deviation for the error should be, so I made an arbitrary choice. Just generate your data sequentially using the formula that you gave.
set.seed(2019)
x = rep(0,1000)
ErrorSD = 0.1
x[1] = rnorm(1,0,100/19)
for(i in 2:1000) {
x[i] = 0.9*x[i-1] + rnorm(1,0,ErrorSD) }
plot(x, type='l')

Fit distribution to given frequency values in R

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.
Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)
Update.
To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:
7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.
Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:
# Loop to simulate values
set.values <- c()
for(i in 1:length(sample)){
set.values <<- c(set.values, rep(i, times = sample[i]))
}
hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)
and use fitdistr on the set.values:
f2 <- fitdistr(set.values, 'weibull')
f2
Why I think it is incorrect way and why I am looking for a better solution in R?
in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)
Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.
Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.
If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.
library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22
s.fit <- sample[3:23]
t.fit <- t.sample[3:23]
wx <- function(param) {
res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
return(res)
}
minwx <- function(param){
v <- s.fit-wx(param)
sqrt(sum(v*v))
}
p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))
popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)
plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)
You can directly calculate the maximum likelihood parameters, as described here.
# Defining the error of the implicit function
k.diff <- function(k, vec){
x2 <- seq(length(vec))
abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2),
w = x2^k*sample))
}
# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min
# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)
# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))
Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:
sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
611,1037,727,489,432,371,1125,69,595,624)
f<-fitdistr(sample, 'weibull')
f
If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.
For your example this would look like this:
ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
ks
The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.
Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.
f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)
f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1])
hist(z)

Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf

I am working with the cumulative emergence of flies over time (taken at irregular intervals) over many summers (though first I am just trying to make one year work). The cumulative emergence follows a sigmoid pattern and I want to create a maximum likelihood estimation of a 3-parameter Weibull cumulative distribution function. The three-parameter models I've been trying to use in the fitdistrplus package keep giving me an error. I think this must have something to do with how my data is structured, but I cannot figure it out. Obviously I want it to read each point as an x (degree days) and a y (emergence) value, but it seems to be unable to read two columns. The main error I'm getting says "Non-numeric argument to mathematical function" or (with slightly different code) "data must be a numeric vector of length greater than 1". Below is my code including added columns in the df_dd_em dataframe for cumulative emergence and percent emergence in case that is useful.
degree_days <- c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94)
emergence <- c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0)
cum_em <- cumsum(emergence)
df_dd_em <- data.frame (degree_days, emergence, cum_em)
df_dd_em$percent <- ave(df_dd_em$emergence, FUN = function(df_dd_em) 100*(df_dd_em)/46)
df_dd_em$cum_per <- ave(df_dd_em$cum_em, FUN = function(df_dd_em) 100*(df_dd_em)/46)
x <- pweibull(df_dd_em[c(1,3)],shape=5)
dframe2.mle <- fitdist(x, "weibull",method='mle')
Here's my best guess at what you're after:
Set up data:
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd <- transform(dd,cum_em=cumsum(emergence))
We're actually going to fit to an "interval-censored" distribution (i.e. probability of emergence between successive degree day observations: this version assumes that the first observation refers to observations before the first degree-day observation, you could change it to refer to observations after the last observation).
library(bbmle)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun <- function(scale,shape,x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull(c(-Inf,x), ## or (c(x,Inf))
shape=shape,scale=scale)),1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
library(bbmle)
I should probably have used something more systematic like the method of moments (i.e. matching the mean and variance of a Weibull distribution with the mean and variance of the data), but I just hacked around a bit to find plausible starting values:
## preliminary look (method of moments would be better)
scvec <- 10^(seq(0,4,length=101))
plot(scvec,sapply(scvec,NLLfun,shape=1))
It's important to use parscale to let R know that the parameters are on very different scales:
startvals <- list(scale=1000,shape=1)
m1 <- mle2(NLLfun,start=startvals,
control=list(parscale=unlist(startvals)))
Now try with a three-parameter Weibull (as originally requested) -- requires only a slight modification of what we already have:
library(FAdist)
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$emergence) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
Looks like the three-parameter fit is much better:
library(emdbook)
AICtab(m1,m2)
## dAIC df
## m2 0.0 3
## m1 21.7 2
And here's the graphical summary:
with(dd,plot(cum_em~degree_days,cex=3))
with(as.list(coef(m1)),curve(sum(dd$emergence)*
pweibull(x,shape=shape,scale=scale),col=2,
add=TRUE))
with(as.list(coef(m2)),curve(sum(dd$emergence)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))
(could also do this more elegantly with ggplot2 ...)
These don't seem like spectacularly good fits, but they're sane. (You could in principle do a chi-squared goodness-of-fit test based on the expected number of emergences per interval, and accounting for the fact that you've fitted a three-parameter model, although the values might be a bit low ...)
Confidence intervals on the fit are a bit of a nuisance; your choices are (1) bootstrapping; (2) parametric bootstrapping (resample parameters assuming a multivariate normal distribution of the data); (3) delta method.
Using bbmle::mle2 makes it easy to do things like get profile confidence intervals:
confint(m1)
## 2.5 % 97.5 %
## scale 1576.685652 1777.437283
## shape 4.223867 6.318481
dd <- data.frame(degree_days=c(998.08,1039.66,1111.29,1165.89,1236.53,1293.71,
1347.66,1387.76,1445.47,1493.44,1553.23,1601.97,
1670.28,1737.29,1791.94,1849.20,1920.91,1967.25,
2036.64,2091.85,2152.89,2199.13,2199.13,2263.09,
2297.94,2352.39,2384.03,2442.44,2541.28,2663.90,
2707.36,2773.82,2816.39,2863.94),
emergence=c(0,0,0,1,1,0,2,3,17,10,0,0,0,2,0,3,0,0,1,5,0,0,0,0,
0,0,0,0,1,0,0,0,0,0))
dd$cum_em <- cumsum(dd$emergence)
dd$percent <- ave(dd$emergence, FUN = function(dd) 100*(dd)/46)
dd$cum_per <- ave(dd$cum_em, FUN = function(dd) 100*(dd)/46)
dd <- transform(dd)
#start 3 parameter model
library(FAdist)
## y*log(p) allowing for 0/0 occurrences:
y_log_p <- function(y,p) ifelse(y==0 & p==0,0,y*log(p))
NLLfun2 <- function(scale,shape,thres,
x=dd$degree_days,y=dd$percent) {
prob <- pmax(diff(pweibull3(c(-Inf,x),shape=shape,scale=scale,thres)),
1e-6)
## multinomial probability
-sum(y_log_p(y,prob))
}
startvals2 <- list(scale=1000,shape=1,thres=100)
m2 <- mle2(NLLfun2,start=startvals2,
control=list(parscale=unlist(startvals2)))
summary(m2)
#graphical summary
windows(5,5)
with(dd,plot(cum_per~degree_days,cex=3))
with(as.list(coef(m2)),curve(sum(dd$percent)*
pweibull3(x,shape=shape,
scale=scale,thres=thres),col=4,
add=TRUE))

Finding distribution of sample mean by central limit theorem

let X1,...,X25 be a random sampe from normal distribution with mean=37 and sd=45.
Let xbar be the sample mean. How is xbar distributed? I have to verify it by central limit theorem.
Also compute P(xbar>43.1)
my attempt
for(i in 1:1000){
x=rnorm(25,mean=37,sd=45)
xbar=mean(x)
z=(xbar-37)/(45/sqrt(25))
}
z
But i couldn't find the distribution of xbar.
Change your for loop and use replicate instead
set.seed(1)
X <- replicate(1000, rnorm(25,mean=37,sd=45))
X_bar <- colMeans(X)
hist(X_bar) # this is how the distribution of X_bar looks like
xbar=c()
for(i in 1:1000){
x=rnorm(25,mean=37,sd=45)
xbar=c(xbar,mean(x)) #save every time the value of xbar
}
hist(xbar) #plot the hist of xbar
#compute the probability to b e bigger thant 43.1
prob=which(xbar>43.1)/length(xbar)
Just to expand in this a little bit.
The Central Limit Theorem states the distribution of the mean is asymptotically N[mu, sd/sqrt(n)]. Where mu and sd are the mean and standard deviation of the underlying distribution, and n is the sample size used in calculating the mean. So, in the example below data is a dataset of size 2500 drawn from N[37,45], arbitrarily segmented into 100 groups of 25. means is a dataset of the means of each group. Note that both the data and the means are (aprox.) normally distributed, but the distribution of the means is much tighter (lower sigma). From the CLT we expect sd(mean) ~ sd(data)/sqrt(25), which it is.
data <- data.frame(sample=rep(1:100,each=25),x = rnorm(2500,mean=37,sd=45))
means <- aggregate(data$x,by=list(data$sample),mean)
#plot histoggrams
par(mfrow=c(1,2))
hist(data$x,main="",sub="Histogram of Underlying Data",xlim=c(-150,200))
hist(means$x,main="",sub="Histogram of Means", xlim=c(-150,200))
mtext("Underlying Data ~ N[37,45]",outer=T,line=-3)
c(sd.data=sd(data$x), sd.means=sd(means$x))
sd.data sd.means
43.548570 7.184518
But the real power of the CLT is that it shows that the distribution of the means is asymptotically normal, regardless of the distribution of the underlying data. This is shown here, where the underlying data is sampled from a uniform distribution. Again, sd(mean) ~ sd(data)/sqrt(25).
data <- data.frame(sample=rep(1:100,each=25),x = runif(2500,min=-150, max=200))
means <- aggregate(data$x,by=list(data$sample),mean)
#plot histoggrams
par(mfrow=c(1,2))
hist(data$x,main="",sub="Histogram of Underlying Data",xlim=c(-150,200))
hist(means$x,main="",sub="Histogram of Means", xlim=c(-150,200))
mtext("Underlying Data ~ U[-150,200]",outer=T,line=-3)
c(sd.data=sd(data$x), sd.means=sd(means$x))
sd.data sd.means
99.7800 18.8176

Confidence interval for Weibull distribution

I have wind data that I'm using to perform extreme value analysis (calculate return levels). I'm using R with packages 'evd', 'extRemes' and 'ismev'.
I'm fitting GEV, Gumbel and Weibull distributions, in order to estimate the return levels (RL) for some period T.
For the GEV and Gumbel cases, I can get RL's and Confidence Intervals using the extRemes::return.level() function.
Some code:
require(ismev)
require(MASS)
data(wind)
x = wind[, 2]
rperiod = 10
fit <- fitdistr(x, 'weibull')
s <- fit$estimate['shape']
b <- fit$estimate['scale']
rlevel <- qweibull(1 - 1/rperiod, shape = s, scale = b)
## CI around rlevel
## ci.rlevel = ??
But for the Weibull case, I need some help to generate the CI's.
I suspect the excruciatingly correct answer will be that the joint confidence region is an ellipse or some bent-sausage shape but you can extract variance estimates for the parameters from the fit object with the vcov function and then build standard errors for which +/- 1.96 SE's should be informative:
> sqrt(vcov(fit)["shape", "shape"])
[1] 0.691422
> sqrt(vcov(fit)["scale", "scale"])
[1] 1.371256
> s +c(-1,1)*sqrt(vcov(fit)["shape", "shape"])
[1] 6.162104 7.544948
> b +c(-1,1)*sqrt(vcov(fit)["scale", "scale"])
[1] 54.46597 57.20848
The usual way to calculate a CI for a single parameter is to assume Normal distribution and use theta+/- 1.96*SE(theta). In this case, you have two parameters so doing that with both of them would give you a "box", the 2D analog of an interval. The truly correct answer would be something more complex in the 'scale'-by-'shape' parameter space and might be most easily achieved with simulation methods, unless you have a better grasp of theory than I have.

Resources