What is wrong with Root Mean Square Error?

What is wrong with Root Mean Square Error? - math

I don't understand what is wrong with my rMSE implementation. I'm training my model using MSE as loss function and same for metrics. After training, I use evaluate function for evaluate my model in test set, and then predict function to get values. Then I apply rMSE. My code is:
obs= model.compile(loss='mse', optimizer=keras.optimizers.Adam(lr=0.001),metrics=['mse'])
.......
test_eval = model.evaluate(X_test, Y_test, verbose=1)
print('Test loss (MSE):', test_eval[0])
predicted= model.predict(X_test, verbose=0)
rMSE = np.sqrt(pow(np.mean(predited- Y_test), 2))
print(rMSE)
And I had this results:
Test loss (MSE): 12.0075311661
2.90274470011
But square of 12.0075311661 isn't 2.90274470011. So, what is wrong?

Elementwise square the differences before finding the mean. You want to find the average of the squared difference, not the square of the average distance.

Related

How to lower model sensitivity for the starting value of a weighting function for a MIDAS regression in R

I'm using the midas_r package and I'm wondering if there is a possibility to lower the MIDAS model sensitivity for the starting value of a weighting function to minimize my error metric.
I did a simulation with different starting values and I observe that the forecasting results are quite sensitive to the initial values. There is around 30% difference between the min and max Root Mean Square Forecast Error (RMSFE) for the simulation.
I simulated the starting value distribution below :
df<-setNames(data.frame(matrix(ncol=2,nrow=n_simulation)),c('Starting_value','RMSFE'))
for ( i in 1:n_simulation){
randomvalue_1 <- runif(1,-5.0,5.0)
randomvalue_2 <- runif(1,-5.0,5.0)
randomvalue_3 <- runif(1,-5.0,5.0)
random_vecteur=c(randomvalue_1,randomvalue_2)
mod1 <- midas_r(target_data ~ mls(daily_data, 1:2, 25, nealmon) + mls(target_data, 1:1, 1),
start=list(daily_data=random_vecteur),Ofunction = 'optim',method='Nelder-Mead')
##Calculate average forecasts
avgf <- average_forecast(list(mod1),
data=list(daily_data=daily_data,target_data=target_data),
insample=1:132,outsample=133:180,
type="rolling",
measures=c("MSE","MAPE","MASE"),
fweights=c("EW","BICW","MSFE","DMSFE"))
df$Starting_value[i]=paste('(',paste(toString(random_vecteur),')'))
df$`RMSFE`[i]=sqrt(avgf$accuracy$individual$MSE.out.of.sample[1])}
Is there something that I can do to lower the model sensitivity, Or I'm doing something wrong?
I tried to use the update function #update(Ofunction='nls') as suggested in Mixed Frequency Data Sampling Regression (2016) Models: The R Package midasr, but I still observe the sensitivity.
I'm willing to share my data if needed
Thank you!

Interpreting R Bootstrapping Output: Confidence Intervals

I am trying to understand bootstrapping in R using the Boot package. I am trying to do a simple spearman rank correlation. I have some code based on a tutorial I found online but am having some issues interpreting the output. The code is below:
*Note: these data are just random numbers I used to help me learn how to run the boot function. They do not represent anything.
test_a=data.frame(v1 = c(1,5,8,3,2,9,5,10,3,5), v2 = c(3,4,7,2,1,10,3,8,8,2))
attach(test_a)
cor.test(v1, v2, method = "spearman")
function_2 = function(test_a, i) {
d2 = test_a[i,]
return(cor(d2$v1, d2$v2, method="spearman"))
}
set.seed(1)
test_boot = boot(test_a, function_2, R=1000)
test_boot
I get the following output:
boot(data = test_a, statistic = function_2, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 0.6397639 -0.04253283 0.2547429
Which all makes sense to me. But I guess my confusion is with the boot.ci function
ci = boot.ci(test_boot, conf=0.95)
I get the following output:
> ci
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = test_boot, conf = 0.95)
Intervals :
Level Normal Basic
95% ( 0.1830, 1.1816 ) ( 0.3173, 1.2987 )
Level Percentile BCa
95% (-0.0192, 0.9622 ) (-0.1440, 0.9497 )
Calculations and Intervals on Original Scale
And this is where I am a bit lost. I can't really find a source that explains in layman's terms in the context of a correlation coefficient, because obviously you cannot have a correlation > 1.0, yet this spits out a confidence interval (at least with two methods) that goes above 1. The sources that discuss these different confidence intervals frankly have been a bit confusing. Is there any one of these that is better for certain parameters than others? It is also possible I am completely misinterpreting what I am doing as well.
I also include the results of plot(test_boot) for your reference.
The eventual goal (with actual data) once I am more confident in running and interpreting results of bootstrapping would be to run tests for trends with time (Mann-Kendall test for trends and Thiel-Sen Slope Estimator, I cannot use parametric statistics with my data :/) and compare my observed dataset with bootstrapped samples.
Any help would really be appreciated. Thank you in advance!

The two top intervals are normal theory intervals. They use the bootstrap to calculate the standard error and then make symmetric intervals that may or may not respect the bounds of the statistic. The bottom two intervals are percentile intervals (the first is a raw percentile interval and the second is a bias-corrected, accelerated interval). These identify particular values of the bootstrap statistics that define the CI. As such, they will always respect the theoretical bounds of the statistic being bootstrapped.

Finding Mean Squared Error?

I have produced a linear data set and have used lm() to fit a model to that dataset. I am now trying to find the MSE using mse()
I know the formula for MSE but I'm trying to use this function. What would be the proper way to do so? I have looked at the documentation, but I'm either dumb or it's just worded for people who actually know what they're doing.
library(hydroGOF)
x.linear <- seq(0, 200, by=1) # x data
error.linear <- rnorm(n=length(x.linear), mean=0, sd=1) # Error (0, 1)
y.linear <- x.linear + error.linear # y data
training.data <- data.frame(x.linear, y.linear)
training.model <- lm(training.data)
training.mse <- mse(training.model, training.data)
plot(training.data)
mse() needs two data frames. I'm not sure how to get a data frame out of lm(). Am I even on the right track to finding a proper MSE for my data?

Try this:
mean((training.data - predict(training.model))^2)
#[1] 0.4467098

You can also use below mentioned code which is very clean to get mean square error
install.packages("Metrics")
library(Metrics)
mse(actual, predicted)
The first data set on which is actual one : training.data
The second argument is the one which you will predict like :
pd <- predict(training.model , training.data)
mse(training.data$,pd)
Seems you have not done prediction yet so first predict the data based on your model and then calculate mse

You can use the residual component from lm model output to find mse in this manner :
mse = mean(training.model$residuals^2)

Note: if you come from another program (like SAS)
they get the mean using the sum and the degrees of freedom of the residual. I recommend doing the same if you want a more accurate estimate of the error.
mse = sum(training.model$residuals^2)/training.model$df.residual
I found this while trying to figure out why mean(my_model$residuals^2) was different in R than the MSE in SAS.

OHCL Time Series - Anomaly Detection with Multivariate Gaussian Distribution

I have a OHLC time series for some stock prices:
library(quantmod)
library(mnormt)
library(MASS)
download.file("http://dl.dropbox.com/u/25747565/941.RData", destfile="test.RData")
load("test.RData")
chartSeries(p)
As you can see from the plot, there are two downward spikes, most likely due to some sort of data error. I want to use a multivariate Gaussian to detect the rows which contain these two offending data points.
> x[122,]
941.Open 941.High 941.Low 941.Close
85.60 86.65 5.36 86.20
> x[136,]
941.Open 941.High 941.Low 941.Close
84.15 85.60 54.20 85.45
Here is my code to fit the distribution and calculate the probabilities of each data point:
x <- coredata(p[,1:4])
mu <- apply(x, 2, mean)
sigma <- cov.rob(x)$cov
prob <- apply(x, 1, dmnorm, mean = mu, varcov = sigma, log = TRUE)
However, this code throws up the following error:
Error in pd.solve(varcov, log.det = TRUE) : x appears to be not symmetric
This error did not come up when I used the standard cov() function to calculate the covariance matrix, but only with the Robust covariance matrix function. The covariance matrix itself looks quite benign to me so I'm not sure what is going on. The reasons I want to use a robust estimation of the covariance matrix is because the standard covariance matrix was giving a few false positives as I was including anomalies in my training set.
Can someone tell me:
a) how to fix this
b) if this approach even makes sense
Thank you!
P.S. I considered posting this on Cross Validated but thought SO was more appropriate as this seems like a "programming" issue.

You are going right. I don't know the how to fix the error but I can suggest you how to find the indices of novel points.
Your training data may have small number of anomalies otherwise cross-validation and test results will suffer
Calculate mean and covariance in MATLAB
as
mu=mean(traindata);
sigma=cov(traindata);
set epsilon as per your requirement
count=0;
Actual=[];
for j=1:size(cv,1)
p=mvnpdf(cv(j,:),mu,sigma);
if p<eplison
count=count+1;
Actual=[Actual;j];
fprintf('j=%d \t p=%e\n',j,p);
end
end
tune threshold if results are not satisfactory
Evaluate model using F-1 score(if F-1 score is 1, you did it right)
Apply the model on test data
done!

Log Likelihood using R

I have a probability density function (PDF)
(1-cos(x-theta))/(2*pi)
theta is the unknown parameter. How do I write a log likelihood function for this PDF? I am confused; the x will come from my data, but how do I handle the theta in the equation.
Thanks

You need to use an optimisation or maximisation function in R to compute the value of theta that maximises the log-likelihood. See help(nlmin) for starters.

The function you wrote is a likelihood function of theta given the known x:
ll(theta|x) = log((1-cos(x-theta))/(2*pi))
if you have many iid observations from this distribution, x1,x2,...xn just take the sum of the above:
ll(theta|x1,x2,...) = Sum[log((1-cos(xi-theta))/(2*pi))]

If f(x_i) = (1-cos(x_i-theta))/(2*pi) for observation i, then likelihood function L(Theta)=product(f(x_i)) and logL(theta)=sum(f(x_i)), of course assuming that x_i are independent.

I think log-likelihood only works for normal-distributions. The special property of the log-function is, that it cancels out the exp-function, but here's no exp-function.
Btw., your PDF is periodic and theta just manipulates the phase of that function. Where does this PDF come from? What should it describe?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

What is wrong with Root Mean Square Error? - math

Elementwise square the differences before finding the mean. You want to find the average of the squared difference, not the square of the average distance.

Related

How to lower model sensitivity for the starting value of a weighting function for a MIDAS regression in R

Interpreting R Bootstrapping Output: Confidence Intervals

Finding Mean Squared Error?

OHCL Time Series - Anomaly Detection with Multivariate Gaussian Distribution

Log Likelihood using R

Categories

Resources