OHCL Time Series - Anomaly Detection with Multivariate Gaussian Distribution - r

I have a OHLC time series for some stock prices:
library(quantmod)
library(mnormt)
library(MASS)
download.file("http://dl.dropbox.com/u/25747565/941.RData", destfile="test.RData")
load("test.RData")
chartSeries(p)
As you can see from the plot, there are two downward spikes, most likely due to some sort of data error. I want to use a multivariate Gaussian to detect the rows which contain these two offending data points.
> x[122,]
941.Open 941.High 941.Low 941.Close
85.60 86.65 5.36 86.20
> x[136,]
941.Open 941.High 941.Low 941.Close
84.15 85.60 54.20 85.45
Here is my code to fit the distribution and calculate the probabilities of each data point:
x <- coredata(p[,1:4])
mu <- apply(x, 2, mean)
sigma <- cov.rob(x)$cov
prob <- apply(x, 1, dmnorm, mean = mu, varcov = sigma, log = TRUE)
However, this code throws up the following error:
Error in pd.solve(varcov, log.det = TRUE) : x appears to be not symmetric
This error did not come up when I used the standard cov() function to calculate the covariance matrix, but only with the Robust covariance matrix function. The covariance matrix itself looks quite benign to me so I'm not sure what is going on. The reasons I want to use a robust estimation of the covariance matrix is because the standard covariance matrix was giving a few false positives as I was including anomalies in my training set.
Can someone tell me:
a) how to fix this
b) if this approach even makes sense
Thank you!
P.S. I considered posting this on Cross Validated but thought SO was more appropriate as this seems like a "programming" issue.

You are going right. I don't know the how to fix the error but I can suggest you how to find the indices of novel points.
Your training data may have small number of anomalies otherwise cross-validation and test results will suffer
Calculate mean and covariance in MATLAB
as
mu=mean(traindata);
sigma=cov(traindata);
set epsilon as per your requirement
count=0;
Actual=[];
for j=1:size(cv,1)
p=mvnpdf(cv(j,:),mu,sigma);
if p<eplison
count=count+1;
Actual=[Actual;j];
fprintf('j=%d \t p=%e\n',j,p);
end
end
tune threshold if results are not satisfactory
Evaluate model using F-1 score(if F-1 score is 1, you did it right)
Apply the model on test data
done!

Related

How to estimate a parameter in R- sample with replace

I have a txt file with numbers that looks like this(but with 100 numbers) -
[1] 7.1652348 5.6665965 4.4757553 4.8497086 15.2276296 -0.5730937
[7] 4.9798067 2.7396933 5.1468304 10.1221489 9.0165661 65.7118194
[13] 5.5205704 6.3067488 8.6777177 5.2528503 3.5039562 4.2477401
[19] 11.4137624 -48.1722034 -0.3764006 5.7647536 -27.3533138 4.0968204
I need to estimate MLE theta parameter from this distrubution -
[![this is my distrubution ][1]][1]
and I need to estimate theta from a sample of 1000 observations with replace, and save the sample, and do a hist.
How can I estimate theta from my sample? I have no information about normal distrubation.
I wrote something like this -
data<-read.table(file.choose(), header = TRUE, sep= "")
B <- 1000
sample.means <- numeric(data)
sample.sd <- numeric(data)
for (i in 1:B) {
MySample <- sample(data, length(data), replace = TRUE)
sample.means <- c(sample.means,mean(MySample))
sample.sd <- c(sample.sd,sd(MySample))
}
sd(sample.sd)
but it doesn't work..
This question incorporates multiple different ones, so let's tackle each step by step.
First, you will need to draw a random sample from your population (with replacement). Assuming your 100 population-observations sit in a vector named pop.
rs <- sample(pop, 1000, replace = True)
gives you your vector of random samples. If you wanna save it, you can write it to your disk in multiple formats, so I'll just suggest a few related questions (How to Export/Import Vectors in R?).
In a second step, you can use the mle()-function of the stats4-package (https://stat.ethz.ch/R-manual/R-devel/library/stats4/html/mle.html) and specify the objective function explicitly.
However, the second part of your question is more of a statistical/conceptual question than R related, IMO.
Try to understand what MLE actually does. You do not need normally distributed variables. The idea behind MLE is to choose theta in such a way, that under the resulting distribution the random sample is the most probable. Check https://en.wikipedia.org/wiki/Maximum_likelihood_estimation for more details or some youtube videos, if you'd like a more intuitive approach.
I assume, in the description of your task, it is stated that f(x|theta) is the conditional joint density function and that the observations x are iir?
What you wanna do in this case, is to select theta such that the squared difference between the observation x and the parameter theta is minimized.
For your statistical understanding, in such cases, it makes sense to perform log-linearization on the equation, instead of dealing with a non-linear function.
Minimizing the squared difference is equivalent to maximizing the log-transformed function since the sum is negative (<=> the product was in the denominator) and the log, as well as the +1 are solely linear transformations.
This leaves you with the maximization problem:
And the first-order condition:
Obviously, you would also have to check that you are actually dealing with a maximum via the second-order condition but I'll omit that at this stage for simplicity.
The algorithm in R does nothing else than solving this maximization problem.
Hope this helps for your understanding. Maybe some smarter people can give some additional input.

Least Squares fit of model - R

The data file (X in code thread below) contains the record of monthly data X[t] over a twenty year period.
The data can be modelled by X[12j+i] = Mu + s[i] + Y[12j+i] where (i=1,...,12; j=1,...,k) where Mu, s[1],...,s[12] are parameters of the model, Z[t] is white noise WN(0,sigma^2) and k=20. Given the least-squares estimators of Mu and Mu+s[i] are the overall mean and the mean of all observations recorded in the ith period, respectively. Obtain a least-squares fit of this model to the data.
X <- c(20.73,20.51,21.04,21.34,21.60,21.67,21.93,22.18,21.55,21.38,20.78,20.75,20.57,20.09,20.61,21.33,21.72,21.83,21.70,22.62,21.40,21.53,20.71,20.82,20.73,20.65,20.67,21.40,21.21,21.63,21.77,22.20,21.29,21.44,21.01,20.75,20.64,20.24,21.03,21.61,21.46,21.61,22.08,22.66,21.21,20.95,20.88,20.37,20.53,20.30,21.26,21.14,21.99,21.88,22.46,22.31,21.65,21.60,20.62,20.71,20.64,20.94,20.89,21.19,21.57,21.91,21.71,21.95,21.52,21.06,20.77,20.50,20.67,20.77,21.06,21.70,20.73,21.83,21.71,22.99,21.81,20.97,20.72,20.43,20.49,20.33,20.95,21.34,21.61,21.88,22.28,22.33,21.16,21.00,21.07,20.59,20.87,20.59,21.06,21.23,21.59,21.80,21.76,22.48,21.91,20.96,20.83,20.86,20.36,20.48,20.89,21.35,21.80,21.87,22.13,22.54,21.91,21.33,21.18,20.67,20.98,20.77,21.22,21.09,21.37,21.71,22.45,22.25,21.70,21.67,20.59,21.12,20.35,20.86,20.87,21.29,21.96,21.85,21.90,22.10,21.64,21.56,20.46,20.43,20.87,20.38,21.05,20.78,21.99,21.59,22.29,22.23,21.70,21.12,20.69,20.47,20.42,20.51,21.10,21.39,21.98,21.86,22.40,22.04,21.69,21.32,20.74,20.51,20.21,20.29,20.64,21.29,22.03,21.90,22.22,22.07,21.95,21.57,21.01,20.27,20.97,20.64,20.95,21.19,22.02,21.73,22.35,22.45,21.50,21.15,21.04,20.28,20.27,20.48,20.83,21.78,22.11,22.31,21.80,22.52,21.41,21.13,20.61,20.81,20.82,20.42,21.20,21.19,21.39,22.33,21.91,22.36,21.53,21.53,21.12,20.37,21.01,20.23,20.71,21.17,21.63,21.89,22.34,22.23,21.45,21.32,21.05,20.90,20.80,20.69,20.49,21.28,21.68,21.98,21.67,22.63,21.77,21.36,20.61,20.83)
I found the least squares estimator of Mu and (Mu+s[i])
lse.Mu <- mean(X)
IndicatorVar <- rep(1:12,20)
lse.Mu.si <- c(1:12)
for(i in 1:12){lse.Mu.si[i] <- mean(X[IndicatorVar==i])
This is where I get confused. I'm not sure what to do next to find the least squares fit of the model. I tried finding the estimator of Y:
est.Y <- c(1:240)
for(i in 1:12){for(j in 0:19){est.Y[(12*j)+i] <- X[(12*j)+i] - lse.Mu.si[i]}}
But still I don't know how to use it to find least squares fit or where the Z[t] white noise comes into it.
Can you please help point me in the right direction or let me know what code to use? Iv spent three days on google and I still cant work it out!
Following on from this I need to examine the validity of the model by making a graphical comparison of the data with the model and employ any statistical test that are considered appropriate. Any suggestions on which graphs and statistical tests would be best to use would be greatly appreciated.

Critical Value for Shapiro Wilk test

I am trying to get the critical W value for a Shapiro Wilk Test in R.
Shapiro-Wilk normality test
data: samplematrix[, 1]
W = 0.69661, p-value = 7.198e-09
With n=50 and alpha=.05, I know that the critical value W=.947, by conducting the critical value table. However, how do I get this critical value, using R?
Computing critical values directly is not easy (see this CrossValidated answer); what I've got here is essentially the same as what's in that answer (although I came up with it independently, and it improves on that answer slightly by using order statistics rather than random samples). The idea is that we can make a sample progressively more non-Normal until it gets exactly the desired p-value (0.05 in this case), then see what W-statistic corresponds to that sample.
## compute S-W for a given Gamma shape parameter and sample size
tmpf <- function(gshape=20,n=50) {
shapiro.test(qgamma((1:n)/(n+1),scale=1,shape=gshape))
}
## find shape parameter that corresponds to a particular p-value
find.shape <- function(n,alpha) {
uniroot(function(x) tmpf(x,n)$p.value-alpha,
interval=c(0.01,100))$root
}
find.W <- function(n,alpha) {
s <- find.shape(n,alpha)
tmpf(s,n=n)$statistic
}
find.W(50,0.05)
The answer (0.9540175) is not quite the same as the answer you got, because R uses an approximation to the Shapiro-Wilk test. As far as I know, the actual S-W critical value tables stem entirely from Shapiro and Wilk 1965 Biometrika http://www.jstor.org/stable/2333709 p. 605, which says only "Based on fitted Johnson (1949) S_B approximation, see Shapiro and Wilk 1965a for details" - and "Shapiro and Wilk 1965a" refers to an unpublished manuscript! (S&W essentially sampled many Normal deviates, computed the SW statistic, constructed smooth approximations of the SW statistic over a range of values, and took the critical values from this distribution).
I also tried to do this by brute force, but (see below) if we want to be naive and not do curve-fitting as SW did, we will need much larger samples ...
find.W.stoch <- function(n=50,alpha=0.05,N=200000,.progress="none") {
d <- plyr::raply(N,.Call(stats:::C_SWilk,sort(rnorm(n))),
.progress=.progress)
return(quantile(d[1,],1-alpha))
}
Compare original S&W values (transcribed from the papers) with the R approximation:
SW1965 <- c(0.767,0.748,0.762,0.788,0.803,0.818,0.829,0.842,
0.850,0.859,0.866,0.874,0.881,0.887,0.892,0.897,0.901,0.905,
0.908,0.911,0.914,0.916,0.918,0.920,0.923,0.924,0.926,0.927,
0.929,0.930,0.931,0.933,0.934,0.935,0.936,0.938,0.939,0.940,
0.941,0.942,0.943,0.944,0.945,0.945,0.946,0.947,0.947,0.947)
Rapprox <- sapply(3:50,find.W,alpha=0.05)
Rapprox.stoch <- sapply(3:50,find.W.stoch,alpha=0.05,.progress="text")
par(bty="l",las=1)
matplot(3:50,cbind(SW1965,Rapprox,Rapprox.stoch),col=c(1,2,4),
type="l",
xlab="n",ylab=~W[crit])
legend("bottomright",col=c(1,2,4),lty=1:3,
c("SW orig","R approx","stoch"))

Finding Mean Squared Error?

I have produced a linear data set and have used lm() to fit a model to that dataset. I am now trying to find the MSE using mse()
I know the formula for MSE but I'm trying to use this function. What would be the proper way to do so? I have looked at the documentation, but I'm either dumb or it's just worded for people who actually know what they're doing.
library(hydroGOF)
x.linear <- seq(0, 200, by=1) # x data
error.linear <- rnorm(n=length(x.linear), mean=0, sd=1) # Error (0, 1)
y.linear <- x.linear + error.linear # y data
training.data <- data.frame(x.linear, y.linear)
training.model <- lm(training.data)
training.mse <- mse(training.model, training.data)
plot(training.data)
mse() needs two data frames. I'm not sure how to get a data frame out of lm(). Am I even on the right track to finding a proper MSE for my data?
Try this:
mean((training.data - predict(training.model))^2)
#[1] 0.4467098
You can also use below mentioned code which is very clean to get mean square error
install.packages("Metrics")
library(Metrics)
mse(actual, predicted)
The first data set on which is actual one : training.data
The second argument is the one which you will predict like :
pd <- predict(training.model , training.data)
mse(training.data$,pd)
Seems you have not done prediction yet so first predict the data based on your model and then calculate mse
You can use the residual component from lm model output to find mse in this manner :
mse = mean(training.model$residuals^2)
Note: if you come from another program (like SAS)
they get the mean using the sum and the degrees of freedom of the residual. I recommend doing the same if you want a more accurate estimate of the error.
mse = sum(training.model$residuals^2)/training.model$df.residual
I found this while trying to figure out why mean(my_model$residuals^2) was different in R than the MSE in SAS.

Trouble with 'fitdistrplus' package, t-distribution

I am trying to fit t-distributions to my data but am unable to do so. My first try was
fitdistr(myData, "t")
There are 41 warnings, all saying that NaNs are produced. I don't know how, logarithms seem to be involved. So I adjusted my data somewhat so that all data is >0, but I still have the same problem (9 fewer warnings though...). Same problem with sstdFit(), produces NaNs.
So instead I try with fitdist which I've seen on stackoverflow and CrossValidated:
fitdist(myData, "t")
I then get
Error in mledist(data, distname, start, fix.arg, ...) :
'start' must be defined as a named list for this distribution
What does this mean? I tried looking into the documentation but that told me nothing. I just want to possibly fit a t-distribution, this is so frustrating :P
Thanks!
Start is the initial guess for the parameters of your distribution. There are logs involved because it is using maximum likelihood and hence log-likelihoods.
library(fitdistrplus)
dat <- rt(100, df=10)
fit <- fitdist(dat, "t", start=list(df=2))
I think it's worth adding that in most cases, using the fitdistrplus package to fit a t-distribution to real data will lead to a very bad fit, which is actually quite misleading. This is because the default t-distribution functions in R are used, and they don't support shifting or scaling. That is, if your data has a mean other than 0, or is scaled in some way, then the fitdist function will simply lead to a bad fit.
In real life, if data fits a t-distribution, it is usually shifted (i.e. has a mean other than 0) and / or scaled. Let's generate some data like that:
data = 1.5*rt(10000,df=5) + 0.5
Given this data has been sampled from the t-distribution with 5 degrees of freedom, you'd think that trying to fit a t-distribution to this should work quite nicely. But actually, here is the result. It estimates a df of 2, and provides a bad fit as shown in the qq plot.
> fit_bad <- fitdist(data,"t",start=list(df=3))
> fit_bad
Fitting of the distribution ' t ' by maximum likelihood
Parameters:
estimate Std. Error
df 2.050967 0.04301357
> qqcomp(list(fit_bad)) # generates plot to show fit
When you fit to a t-distribution you want to not only estimate the degrees of freedom, but also a mean and scaling parameter.
The metRology package provides a version of the t-distribution called t.scaled that has a mean and sd parameter in addition to the df parameter [metRology]. Now let's fit it again:
> library("metRology")
> fit_good <- fitdist(data,"t.scaled",
start=list(df=3,mean=mean(data),sd=sd(data)))
> fit_good
Fitting of the distribution ' t.scaled ' by maximum likelihood
Parameters:
estimate Std. Error
df 4.9732159 0.24849246
mean 0.4945922 0.01716461
sd 1.4860637 0.01828821
> qqcomp(list(fit_good)) # generates plot to show fit
Much better :-) The parameters are very close to how we generated the data in the first place! And the QQ plot shows a much nicer fit.

Resources