Calculating the MSE by definition vs. Var - Bias - r

So I am trying to calculate the MSE in two ways.
Say T is an estimator for the value t.
First I am trying to calculate it in R by using the theorem:
MSE(T) = Var(T) + (Bias(T))^2
Secondly, I am trying to calculate it in R by definition, i.e. MSE(T) = E((T-t)^2).
And say that T is an unbiased estimator, i.e. Bias(T) = 0
So in R, MSE(T) = Var(T) which we can just in R: var(T)
But when I try calculating the MSE by definition I get a different number from Var(T)...
And I think that my formula that I wrote in R is wrong, this is what I wrote for MSE definition in R:
It was suggested that "weighted.mean" is equivalent to the "expected value" function.
So I wrote: weighted.mean( (T - 2)^2) where my t = 2.
I hope I provided enough information to get help, thanks in advance.

Related

Julia function for weighted variance returning "wrong" value

I'm trying to calculate the weighted variance using Julia, but when I compare the results
with my own formula, I get a different value.
x = rand(10)
w = Weights(rand(10))
Statistics.var(x,w,corrected=false) #Julia's default function
sum(w.*(x.-mean(x)).^2)/sum(w) #my own formula
When I read the docs for the "var" function, it says that the formula for "corrected=false" is
the one I wrote.
You have to subtract a weighted mean in your formula to get the same result:
sum(w.*(x.-mean(x,w)).^2)/sum(w)
or (to expand it)
sum(w.*(x.- sum(w.*x)/sum(w)).^2)/sum(w)

Defining exponential distribution in R to estimate probabilities

I have a bunch of random variables (X1,....,Xn) which are i.i.d. Exp(1/2) and represent the duration of time of a certain event. So this distribution has obviously an expected value of 2, but I am having problems defining it in R. I did some research and found something about a so-called Monte-Carlo Stimulation, but I don't seem to find what I am looking for in it.
An example of what i want to estimate is: let's say we have 10 random variables (X1,..,X10) distributed as above, and we want to determine for example the probability P([X1+...+X10<=25]).
Thanks.
You don't actually need monte carlo simulation in this case because:
If Xi ~ Exp(λ) then the sum (X1 + ... + Xk) ~ Erlang(k, λ) which is just a Gamma(k, 1/λ) (in (k, θ) parametrization) or Gamma(k, λ) (in (α,β) parametrization) with an integer shape parameter k.
From wikipedia (https://en.wikipedia.org/wiki/Exponential_distribution#Related_distributions)
So, P([X1+...+X10<=25]) can be computed by
pgamma(25, shape=10, rate=0.5)
Are you aware of rexp() function in R? Have a look at documentation page by typing ?rexp in R console.
A quick answer to your Monte Carlo estimation of desired probability:
mean(rowSums(matrix(rexp(1000 * 10, rate = 0.5), 1000, 10)) <= 25)
I have generated 1000 set of 10 exponential samples, putting them into a 1000 * 10 matrix. We take row sum and get a vector of 1000 entries. The proportion of values between 0 and 25 is an empirical estimate of the desired probability.
Thanks, this was helpful! Can I use replicate with this code, to make it look like this: F <- function(n, B=1000) mean(replicate(B,(rexp(10, rate = 0.5)))) but I am unable to output the right result.
replicate here generates a matrix, too, but it is an 10 * 1000 matrix (as opposed to a 1000* 10 one in my answer), so you now need to take colSums. Also, where did you put n?
The correct function would be
F <- function(n, B=1000) mean(colSums(replicate(B, rexp(10, rate = 0.5))) <= n)
For non-Monte Carlo method to your given example, see the other answer. Exponential distribution is a special case of gamma distribution and the latter has additivity property.
I am giving you Monte Carlo method because you name it in your question, and it is applicable beyond your example.

Standard error in glm output

I am using r glm to model Poisson data binned by year. So I have x[i] counts with T[i] exposure in each year, i. The r glm with poisson family log link output produces model coefficients a, b for y = a + bx.
What I need is the standard error of (a + bx) not the standard error of a or the standard error of b. The documentation describing a solution I am trying to implement says this should be calculated by the software because it is not straightforward to calculate from the parameters for a and b. Perhaps SAS does the calc, but I am not recognizing it in R.
I am working working through section 7.2.4.5 of the Handbook of Parameter Estimation (NUREG/CR-6823, a public document) and looking at eq 7.2. Also not a statistician so I am finding this is very hard to follow.
The game here is to find the 90 percent simultaneous confidence interval on the model output, not the confidence interval at each year, i.
Adding this here so I can show some code. The first answer below appears to get me pretty close. A statistician here put together the following function to construct the confidence bounds. This appears to work.
# trend line simultaneous confidence intervals
# according to HOPE 7.2.4.5
HOPE = function(x, ...){
t = data$T
mle<-predict(model, newdata=data.frame(x=data$x), type="response")
se = as.data.frame(predict(model, newdata=data.frame(x=data$x), type="link", se.fit=TRUE))[,2]
chi = qchisq(.90, df=n-1)
upper = (mle + (chi * se))/t
lower = (mle - (chi * se))/t
return(as.data.frame(cbind(mle, t, upper, lower)))
}
I think you need to provide the argument se.fit=TRUE when you create the prediction from the model:
hotmod<-glm(...)
predz<-predict(hotmod, ..., se.fit=TRUE)
Then you should be able to find the estimated standard errors using:
predz$se.fit
Now if you want to do it by hand on this software, it should not be as hard as you suggest:
covmat<-vcov(hotmod)
coeffs<-coef(hotmod)
Then I think the standard error should simply be:
sqrt(t(coeffs) %*% covmat %*% coeffs)
The operator %*% can be used for matrix multiplication in this software.

Get degrees of freedom for a Standardized T Distribution with MLE

First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)

Log Likelihood using R

I have a probability density function (PDF)
(1-cos(x-theta))/(2*pi)
theta is the unknown parameter. How do I write a log likelihood function for this PDF? I am confused; the x will come from my data, but how do I handle the theta in the equation.
Thanks
You need to use an optimisation or maximisation function in R to compute the value of theta that maximises the log-likelihood. See help(nlmin) for starters.
The function you wrote is a likelihood function of theta given the known x:
ll(theta|x) = log((1-cos(x-theta))/(2*pi))
if you have many iid observations from this distribution, x1,x2,...xn just take the sum of the above:
ll(theta|x1,x2,...) = Sum[log((1-cos(xi-theta))/(2*pi))]
If f(x_i) = (1-cos(x_i-theta))/(2*pi) for observation i, then likelihood function L(Theta)=product(f(x_i)) and logL(theta)=sum(f(x_i)), of course assuming that x_i are independent.
I think log-likelihood only works for normal-distributions. The special property of the log-function is, that it cancels out the exp-function, but here's no exp-function.
Btw., your PDF is periodic and theta just manipulates the phase of that function. Where does this PDF come from? What should it describe?

Resources