The data file (X in code thread below) contains the record of monthly data X[t] over a twenty year period.
The data can be modelled by X[12j+i] = Mu + s[i] + Y[12j+i] where (i=1,...,12; j=1,...,k) where Mu, s[1],...,s[12] are parameters of the model, Z[t] is white noise WN(0,sigma^2) and k=20. Given the least-squares estimators of Mu and Mu+s[i] are the overall mean and the mean of all observations recorded in the ith period, respectively. Obtain a least-squares fit of this model to the data.
X <- c(20.73,20.51,21.04,21.34,21.60,21.67,21.93,22.18,21.55,21.38,20.78,20.75,20.57,20.09,20.61,21.33,21.72,21.83,21.70,22.62,21.40,21.53,20.71,20.82,20.73,20.65,20.67,21.40,21.21,21.63,21.77,22.20,21.29,21.44,21.01,20.75,20.64,20.24,21.03,21.61,21.46,21.61,22.08,22.66,21.21,20.95,20.88,20.37,20.53,20.30,21.26,21.14,21.99,21.88,22.46,22.31,21.65,21.60,20.62,20.71,20.64,20.94,20.89,21.19,21.57,21.91,21.71,21.95,21.52,21.06,20.77,20.50,20.67,20.77,21.06,21.70,20.73,21.83,21.71,22.99,21.81,20.97,20.72,20.43,20.49,20.33,20.95,21.34,21.61,21.88,22.28,22.33,21.16,21.00,21.07,20.59,20.87,20.59,21.06,21.23,21.59,21.80,21.76,22.48,21.91,20.96,20.83,20.86,20.36,20.48,20.89,21.35,21.80,21.87,22.13,22.54,21.91,21.33,21.18,20.67,20.98,20.77,21.22,21.09,21.37,21.71,22.45,22.25,21.70,21.67,20.59,21.12,20.35,20.86,20.87,21.29,21.96,21.85,21.90,22.10,21.64,21.56,20.46,20.43,20.87,20.38,21.05,20.78,21.99,21.59,22.29,22.23,21.70,21.12,20.69,20.47,20.42,20.51,21.10,21.39,21.98,21.86,22.40,22.04,21.69,21.32,20.74,20.51,20.21,20.29,20.64,21.29,22.03,21.90,22.22,22.07,21.95,21.57,21.01,20.27,20.97,20.64,20.95,21.19,22.02,21.73,22.35,22.45,21.50,21.15,21.04,20.28,20.27,20.48,20.83,21.78,22.11,22.31,21.80,22.52,21.41,21.13,20.61,20.81,20.82,20.42,21.20,21.19,21.39,22.33,21.91,22.36,21.53,21.53,21.12,20.37,21.01,20.23,20.71,21.17,21.63,21.89,22.34,22.23,21.45,21.32,21.05,20.90,20.80,20.69,20.49,21.28,21.68,21.98,21.67,22.63,21.77,21.36,20.61,20.83)
I found the least squares estimator of Mu and (Mu+s[i])
lse.Mu <- mean(X)
IndicatorVar <- rep(1:12,20)
lse.Mu.si <- c(1:12)
for(i in 1:12){lse.Mu.si[i] <- mean(X[IndicatorVar==i])
This is where I get confused. I'm not sure what to do next to find the least squares fit of the model. I tried finding the estimator of Y:
est.Y <- c(1:240)
for(i in 1:12){for(j in 0:19){est.Y[(12*j)+i] <- X[(12*j)+i] - lse.Mu.si[i]}}
But still I don't know how to use it to find least squares fit or where the Z[t] white noise comes into it.
Can you please help point me in the right direction or let me know what code to use? Iv spent three days on google and I still cant work it out!
Following on from this I need to examine the validity of the model by making a graphical comparison of the data with the model and employ any statistical test that are considered appropriate. Any suggestions on which graphs and statistical tests would be best to use would be greatly appreciated.
First of all, I thank you all beforehand for reading this.
I am trying to fit a Standardized T-Student Distribution (i.e. a T-Student with standard deviation = 1) on a series of data; that is: I want to estimate the degrees of freedom via Maximum Likelihood Estimation.
An example of what I need to achieve can be found in the following (simple) Excel file I made:
https://www.dropbox.com/s/6wv6egzurxh4zap/Excel%20Implementation%20Example.xlsx?dl=0
Inside the Excel file, I have an image that contains the formula corresponding to the calculation of the loglikelihood function for the Standardized T Student Distribution. The formula was extracted from a Finance book (Elements of Financial Risk Management - by Peter Christoffersen).
So far, I have tried this with R:
copula.data <- read.csv(file.choose(),header = TRUE)
z1 <- copula.data[,1]
library(fitdistrplus)
ft1 = fitdist(z1, "t", method = "mle", start = 10)
df1=ft1$estimate[1]
df1
logLik(ft1)
df1 yields the number: 13.11855278779897
logLike(ft1) yields the number: -3600.2918050056487
However, the Excel file yields degrees of freedom of: 8.2962365022727, and a log-likelihood of: -3588.8879 (which is the right answer).
Note: the .csv file that my code reads is the following:
https://www.dropbox.com/s/nnh2jgq4fl6cm12/Data%20for%20T%20Copula.csv?dl=0
Any ideas? Thank you people!
The formula from your spreadsheet (with n, x substituted for the df parameter and the data)
=GAMMALN((n+1)/2)-GAMMALN(n/2)-LN(PI())/2-LN(n-2)/2-1/2*(1+n)*LN(1+x^2/(n-2))
or, exponentiating,
Gamma((n+1)/2) / (sqrt((n-2) pi) Gamma(n/2)) (1+x^2/(n-2))^-((n+1)/2)
?dt gives
f(x) = Gamma((n+1)/2) / (sqrt(n pi) Gamma(n/2)) (1 + x^2/n)^-((n+1)/2)
So the difference lies in those n-2 values in two places in the formula. I don't have enough context to see why the author is defining the t distribution in that different way; there may be some good reason ...
Looking at the negative log-likelihood curve directly, it certainly seems as though the fitdistrplus answer is agreeing with the direct calculation. (It would be very surprising if there were a bug in the dt() function, R's distribution functions are very broadly used and thoroughly tested.)
LL <- function(p,data=z1) {
-sum(dt(data,df=p,log=TRUE))
}
pvec <- seq(6,20,by=0.05)
Lvec <- sapply(pvec,LL)
par(las=1,bty="l")
plot(pvec,Lvec,type="l",
xlab="df parameter",ylab="negative log-likelihood")
## superimpose fitdistr results ...
abline(v=coef(ft1),lty=2)
abline(h=-logLik(ft1),lty=2)
Unless there's something else you're not telling us about the problem definition, it seems to me that R is getting the right answer. (The mean and sd of the data you gave were not exactly equal to 0 and 1 respectively, but they were close; centering and scaling gave an even larger value for the parameter.)
I am unsuccessfully trying to do the Arellano and Bond (1991) estimation using pgmm from the plm package. To see if the problem was in my data, I instead used the data supplied i the plm library, but got the same problem when using the "summary" command:
Error in t(y) %*% x : non-conformable arguments
The coefficients of the model can be obtained though.
My own data has T=3, N=290. As I understand it T=3 is the minimnum, but should be sufficient. When using the Arellano and Bond data, I get the same error when T=4.
data("EmplUK", package = "plm")
library(sqldf)
UK<-sqldf("select * from EmplUK where year in ('1982','1981',
'1980','1979')")
z1 <- pgmm(log(emp) ~ lag(log(emp), 1) + log(wage) +
log(capital) + log(output) | lag(log(emp), 2),
data = UK, effect = "twoways", model = "twosteps")
summary(z1)
The way I understand the estimation method and the R-formula, the left hand term is the difference in the dependent variable, and the first right hand term is the lagged difference. And the latter term is instrumented by the level of the dependent variable in (t-2)
I have verified that subset I use is a balanced panel with T=4. When I include more years, everything works out. So it must be the length of the panel that causes troubles.
Any help would be much appreciated.
A similar question is asked here. It is suggested that the error has to do with mtest, a serial correlation test performed by the pgmm summary method. Running the function separately seems to confirm this
>mtest(z1, order = 2)
Error in t(y) %*% x : non-conformable arguments
T=3 is enough to estimate the model, but this only only leaves you with an estimate for the last period. A second order mtest requires the residuals to contain at least 3 periods, i.e. T=5 for your model.
I have a OHLC time series for some stock prices:
library(quantmod)
library(mnormt)
library(MASS)
download.file("http://dl.dropbox.com/u/25747565/941.RData", destfile="test.RData")
load("test.RData")
chartSeries(p)
As you can see from the plot, there are two downward spikes, most likely due to some sort of data error. I want to use a multivariate Gaussian to detect the rows which contain these two offending data points.
> x[122,]
941.Open 941.High 941.Low 941.Close
85.60 86.65 5.36 86.20
> x[136,]
941.Open 941.High 941.Low 941.Close
84.15 85.60 54.20 85.45
Here is my code to fit the distribution and calculate the probabilities of each data point:
x <- coredata(p[,1:4])
mu <- apply(x, 2, mean)
sigma <- cov.rob(x)$cov
prob <- apply(x, 1, dmnorm, mean = mu, varcov = sigma, log = TRUE)
However, this code throws up the following error:
Error in pd.solve(varcov, log.det = TRUE) : x appears to be not symmetric
This error did not come up when I used the standard cov() function to calculate the covariance matrix, but only with the Robust covariance matrix function. The covariance matrix itself looks quite benign to me so I'm not sure what is going on. The reasons I want to use a robust estimation of the covariance matrix is because the standard covariance matrix was giving a few false positives as I was including anomalies in my training set.
Can someone tell me:
a) how to fix this
b) if this approach even makes sense
Thank you!
P.S. I considered posting this on Cross Validated but thought SO was more appropriate as this seems like a "programming" issue.
You are going right. I don't know the how to fix the error but I can suggest you how to find the indices of novel points.
Your training data may have small number of anomalies otherwise cross-validation and test results will suffer
Calculate mean and covariance in MATLAB
as
mu=mean(traindata);
sigma=cov(traindata);
set epsilon as per your requirement
count=0;
Actual=[];
for j=1:size(cv,1)
p=mvnpdf(cv(j,:),mu,sigma);
if p<eplison
count=count+1;
Actual=[Actual;j];
fprintf('j=%d \t p=%e\n',j,p);
end
end
tune threshold if results are not satisfactory
Evaluate model using F-1 score(if F-1 score is 1, you did it right)
Apply the model on test data
done!