Standard error in glm output - r

I am using r glm to model Poisson data binned by year. So I have x[i] counts with T[i] exposure in each year, i. The r glm with poisson family log link output produces model coefficients a, b for y = a + bx.
What I need is the standard error of (a + bx) not the standard error of a or the standard error of b. The documentation describing a solution I am trying to implement says this should be calculated by the software because it is not straightforward to calculate from the parameters for a and b. Perhaps SAS does the calc, but I am not recognizing it in R.
I am working working through section 7.2.4.5 of the Handbook of Parameter Estimation (NUREG/CR-6823, a public document) and looking at eq 7.2. Also not a statistician so I am finding this is very hard to follow.
The game here is to find the 90 percent simultaneous confidence interval on the model output, not the confidence interval at each year, i.
Adding this here so I can show some code. The first answer below appears to get me pretty close. A statistician here put together the following function to construct the confidence bounds. This appears to work.
# trend line simultaneous confidence intervals
# according to HOPE 7.2.4.5
HOPE = function(x, ...){
t = data$T
mle<-predict(model, newdata=data.frame(x=data$x), type="response")
se = as.data.frame(predict(model, newdata=data.frame(x=data$x), type="link", se.fit=TRUE))[,2]
chi = qchisq(.90, df=n-1)
upper = (mle + (chi * se))/t
lower = (mle - (chi * se))/t
return(as.data.frame(cbind(mle, t, upper, lower)))
}

I think you need to provide the argument se.fit=TRUE when you create the prediction from the model:
hotmod<-glm(...)
predz<-predict(hotmod, ..., se.fit=TRUE)
Then you should be able to find the estimated standard errors using:
predz$se.fit
Now if you want to do it by hand on this software, it should not be as hard as you suggest:
covmat<-vcov(hotmod)
coeffs<-coef(hotmod)
Then I think the standard error should simply be:
sqrt(t(coeffs) %*% covmat %*% coeffs)
The operator %*% can be used for matrix multiplication in this software.

Related

Least Squares fit of model - R

The data file (X in code thread below) contains the record of monthly data X[t] over a twenty year period.
The data can be modelled by X[12j+i] = Mu + s[i] + Y[12j+i] where (i=1,...,12; j=1,...,k) where Mu, s[1],...,s[12] are parameters of the model, Z[t] is white noise WN(0,sigma^2) and k=20. Given the least-squares estimators of Mu and Mu+s[i] are the overall mean and the mean of all observations recorded in the ith period, respectively. Obtain a least-squares fit of this model to the data.
X <- c(20.73,20.51,21.04,21.34,21.60,21.67,21.93,22.18,21.55,21.38,20.78,20.75,20.57,20.09,20.61,21.33,21.72,21.83,21.70,22.62,21.40,21.53,20.71,20.82,20.73,20.65,20.67,21.40,21.21,21.63,21.77,22.20,21.29,21.44,21.01,20.75,20.64,20.24,21.03,21.61,21.46,21.61,22.08,22.66,21.21,20.95,20.88,20.37,20.53,20.30,21.26,21.14,21.99,21.88,22.46,22.31,21.65,21.60,20.62,20.71,20.64,20.94,20.89,21.19,21.57,21.91,21.71,21.95,21.52,21.06,20.77,20.50,20.67,20.77,21.06,21.70,20.73,21.83,21.71,22.99,21.81,20.97,20.72,20.43,20.49,20.33,20.95,21.34,21.61,21.88,22.28,22.33,21.16,21.00,21.07,20.59,20.87,20.59,21.06,21.23,21.59,21.80,21.76,22.48,21.91,20.96,20.83,20.86,20.36,20.48,20.89,21.35,21.80,21.87,22.13,22.54,21.91,21.33,21.18,20.67,20.98,20.77,21.22,21.09,21.37,21.71,22.45,22.25,21.70,21.67,20.59,21.12,20.35,20.86,20.87,21.29,21.96,21.85,21.90,22.10,21.64,21.56,20.46,20.43,20.87,20.38,21.05,20.78,21.99,21.59,22.29,22.23,21.70,21.12,20.69,20.47,20.42,20.51,21.10,21.39,21.98,21.86,22.40,22.04,21.69,21.32,20.74,20.51,20.21,20.29,20.64,21.29,22.03,21.90,22.22,22.07,21.95,21.57,21.01,20.27,20.97,20.64,20.95,21.19,22.02,21.73,22.35,22.45,21.50,21.15,21.04,20.28,20.27,20.48,20.83,21.78,22.11,22.31,21.80,22.52,21.41,21.13,20.61,20.81,20.82,20.42,21.20,21.19,21.39,22.33,21.91,22.36,21.53,21.53,21.12,20.37,21.01,20.23,20.71,21.17,21.63,21.89,22.34,22.23,21.45,21.32,21.05,20.90,20.80,20.69,20.49,21.28,21.68,21.98,21.67,22.63,21.77,21.36,20.61,20.83)
I found the least squares estimator of Mu and (Mu+s[i])
lse.Mu <- mean(X)
IndicatorVar <- rep(1:12,20)
lse.Mu.si <- c(1:12)
for(i in 1:12){lse.Mu.si[i] <- mean(X[IndicatorVar==i])
This is where I get confused. I'm not sure what to do next to find the least squares fit of the model. I tried finding the estimator of Y:
est.Y <- c(1:240)
for(i in 1:12){for(j in 0:19){est.Y[(12*j)+i] <- X[(12*j)+i] - lse.Mu.si[i]}}
But still I don't know how to use it to find least squares fit or where the Z[t] white noise comes into it.
Can you please help point me in the right direction or let me know what code to use? Iv spent three days on google and I still cant work it out!
Following on from this I need to examine the validity of the model by making a graphical comparison of the data with the model and employ any statistical test that are considered appropriate. Any suggestions on which graphs and statistical tests would be best to use would be greatly appreciated.

Get degrees of freedom for a Standardized T Distribution with MLE

First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)

"Non conformable arguments" error with pgmm (plm library)

I am unsuccessfully trying to do the Arellano and Bond (1991) estimation using pgmm from the plm package. To see if the problem was in my data, I instead used the data supplied i the plm library, but got the same problem when using the "summary" command:
Error in t(y) %*% x : non-conformable arguments
The coefficients of the model can be obtained though.
My own data has T=3, N=290. As I understand it T=3 is the minimnum, but should be sufficient. When using the Arellano and Bond data, I get the same error when T=4.
data("EmplUK", package = "plm")
library(sqldf)
UK<-sqldf("select * from EmplUK where year in ('1982','1981',
'1980','1979')")
z1 <- pgmm(log(emp) ~ lag(log(emp), 1) + log(wage) +
log(capital) + log(output) | lag(log(emp), 2),
data = UK, effect = "twoways", model = "twosteps")
summary(z1)
The way I understand the estimation method and the R-formula, the left hand term is the difference in the dependent variable, and the first right hand term is the lagged difference. And the latter term is instrumented by the level of the dependent variable in (t-2)
I have verified that subset I use is a balanced panel with T=4. When I include more years, everything works out. So it must be the length of the panel that causes troubles.
Any help would be much appreciated.
A similar question is asked here. It is suggested that the error has to do with mtest, a serial correlation test performed by the pgmm summary method. Running the function separately seems to confirm this
>mtest(z1, order = 2)
Error in t(y) %*% x : non-conformable arguments
T=3 is enough to estimate the model, but this only only leaves you with an estimate for the last period. A second order mtest requires the residuals to contain at least 3 periods, i.e. T=5 for your model.

Using glm in R to solve simple equation

I have some data from a poisson distribution and have a simple equation I want to solve using glm.
The mathematical equation is observed = y * expected.
I have the observed and expected data and want to use glm to find the optimal value of y which I need to multiply expected by to get observed. I also want to get confidence intervals for y.
Should I be doing something like this
glm(observed ~ expected + offset(log(expected)) + 0, family = 'poisson', data = dataDF)
Then taking the exponential of the coefficient? I tried this but the value given is pretty different to what I get when I divide the sum of the observed by the sum of the expected, and I thought these should be similar.
Am I doing something wrong?
Thanks
Try this:
logFac <- coef( glm(observed ~ offset(expected) , family = 'poisson', data = dataDF))
Fac <- exp( logFac[1] ) # That's the intercept term
That model is really : observed ~ 1 + offset(expected) and since it's being estimated on a log scale, the intercept becomes that conversion factor to convert between 'expected' and 'observed'. The negative comments are evidence that you should have posted on CrossValidated.com where general statistics methods questions are more welcomed.

OHCL Time Series - Anomaly Detection with Multivariate Gaussian Distribution

I have a OHLC time series for some stock prices:
library(quantmod)
library(mnormt)
library(MASS)
download.file("http://dl.dropbox.com/u/25747565/941.RData", destfile="test.RData")
load("test.RData")
chartSeries(p)
As you can see from the plot, there are two downward spikes, most likely due to some sort of data error. I want to use a multivariate Gaussian to detect the rows which contain these two offending data points.
> x[122,]
941.Open 941.High 941.Low 941.Close
85.60 86.65 5.36 86.20
> x[136,]
941.Open 941.High 941.Low 941.Close
84.15 85.60 54.20 85.45
Here is my code to fit the distribution and calculate the probabilities of each data point:
x <- coredata(p[,1:4])
mu <- apply(x, 2, mean)
sigma <- cov.rob(x)$cov
prob <- apply(x, 1, dmnorm, mean = mu, varcov = sigma, log = TRUE)
However, this code throws up the following error:
Error in pd.solve(varcov, log.det = TRUE) : x appears to be not symmetric
This error did not come up when I used the standard cov() function to calculate the covariance matrix, but only with the Robust covariance matrix function. The covariance matrix itself looks quite benign to me so I'm not sure what is going on. The reasons I want to use a robust estimation of the covariance matrix is because the standard covariance matrix was giving a few false positives as I was including anomalies in my training set.
Can someone tell me:
a) how to fix this
b) if this approach even makes sense
Thank you!
P.S. I considered posting this on Cross Validated but thought SO was more appropriate as this seems like a "programming" issue.
You are going right. I don't know the how to fix the error but I can suggest you how to find the indices of novel points.
Your training data may have small number of anomalies otherwise cross-validation and test results will suffer
Calculate mean and covariance in MATLAB
as
mu=mean(traindata);
sigma=cov(traindata);
set epsilon as per your requirement
count=0;
Actual=[];
for j=1:size(cv,1)
p=mvnpdf(cv(j,:),mu,sigma);
if p<eplison
count=count+1;
Actual=[Actual;j];
fprintf('j=%d \t p=%e\n',j,p);
end
end
tune threshold if results are not satisfactory
Evaluate model using F-1 score(if F-1 score is 1, you did it right)
Apply the model on test data
done!

Resources