Least Squares fit of model - R - r

The data file (X in code thread below) contains the record of monthly data X[t] over a twenty year period.
The data can be modelled by X[12j+i] = Mu + s[i] + Y[12j+i] where (i=1,...,12; j=1,...,k) where Mu, s[1],...,s[12] are parameters of the model, Z[t] is white noise WN(0,sigma^2) and k=20. Given the least-squares estimators of Mu and Mu+s[i] are the overall mean and the mean of all observations recorded in the ith period, respectively. Obtain a least-squares fit of this model to the data.
X <- c(20.73,20.51,21.04,21.34,21.60,21.67,21.93,22.18,21.55,21.38,20.78,20.75,20.57,20.09,20.61,21.33,21.72,21.83,21.70,22.62,21.40,21.53,20.71,20.82,20.73,20.65,20.67,21.40,21.21,21.63,21.77,22.20,21.29,21.44,21.01,20.75,20.64,20.24,21.03,21.61,21.46,21.61,22.08,22.66,21.21,20.95,20.88,20.37,20.53,20.30,21.26,21.14,21.99,21.88,22.46,22.31,21.65,21.60,20.62,20.71,20.64,20.94,20.89,21.19,21.57,21.91,21.71,21.95,21.52,21.06,20.77,20.50,20.67,20.77,21.06,21.70,20.73,21.83,21.71,22.99,21.81,20.97,20.72,20.43,20.49,20.33,20.95,21.34,21.61,21.88,22.28,22.33,21.16,21.00,21.07,20.59,20.87,20.59,21.06,21.23,21.59,21.80,21.76,22.48,21.91,20.96,20.83,20.86,20.36,20.48,20.89,21.35,21.80,21.87,22.13,22.54,21.91,21.33,21.18,20.67,20.98,20.77,21.22,21.09,21.37,21.71,22.45,22.25,21.70,21.67,20.59,21.12,20.35,20.86,20.87,21.29,21.96,21.85,21.90,22.10,21.64,21.56,20.46,20.43,20.87,20.38,21.05,20.78,21.99,21.59,22.29,22.23,21.70,21.12,20.69,20.47,20.42,20.51,21.10,21.39,21.98,21.86,22.40,22.04,21.69,21.32,20.74,20.51,20.21,20.29,20.64,21.29,22.03,21.90,22.22,22.07,21.95,21.57,21.01,20.27,20.97,20.64,20.95,21.19,22.02,21.73,22.35,22.45,21.50,21.15,21.04,20.28,20.27,20.48,20.83,21.78,22.11,22.31,21.80,22.52,21.41,21.13,20.61,20.81,20.82,20.42,21.20,21.19,21.39,22.33,21.91,22.36,21.53,21.53,21.12,20.37,21.01,20.23,20.71,21.17,21.63,21.89,22.34,22.23,21.45,21.32,21.05,20.90,20.80,20.69,20.49,21.28,21.68,21.98,21.67,22.63,21.77,21.36,20.61,20.83)
I found the least squares estimator of Mu and (Mu+s[i])
lse.Mu <- mean(X)
IndicatorVar <- rep(1:12,20)
lse.Mu.si <- c(1:12)
for(i in 1:12){lse.Mu.si[i] <- mean(X[IndicatorVar==i])
This is where I get confused. I'm not sure what to do next to find the least squares fit of the model. I tried finding the estimator of Y:
est.Y <- c(1:240)
for(i in 1:12){for(j in 0:19){est.Y[(12*j)+i] <- X[(12*j)+i] - lse.Mu.si[i]}}
But still I don't know how to use it to find least squares fit or where the Z[t] white noise comes into it.
Can you please help point me in the right direction or let me know what code to use? Iv spent three days on google and I still cant work it out!
Following on from this I need to examine the validity of the model by making a graphical comparison of the data with the model and employ any statistical test that are considered appropriate. Any suggestions on which graphs and statistical tests would be best to use would be greatly appreciated.

Related

Generate random data from arbitrary CDF in R?

I have an arbitrary CDF that is applied to a point estimate. I have a number of these point estimates with associated CDFs, that I need to simulate random data for a Monte Carlo simulation.
The CDF I'm generating by doing a spline fit to the arbitrary points provided in a table. For example, the quantile 0.1 is a product of 0.13 * point estimate. The quantile 0.9 is a product of 7.57 * point estimate. It is fairly crude and is based on a large study comparing these models to real world system -- ignore that for now please.
I fit the CDF using a spline fit as shown here.
If I take the derivative of this, I get the shape of the pdf (image).
I modified the function "samplepdf" found here, Sampling from an Arbitrary Density, as follows:
samplecdf <- function(n, cdf, spdf.lower = -Inf, spdf.upper=Inf) {
my_fun <- match.fun(cdf)
invcdf <- function(u) {
subcdf <- function(t) my_fun(t) - u
if (spdf.lower == -Inf)
spdf.lower <- endsign(subcdf, -1)
if (spdf.upper == Inf)
spdf.upper <- endsign(subcdf)
return(uniroot(subcdf, c(spdf.lower, spdf.upper))$root)
}
sapply(runif(n), invcdf)
}
This seems to work, OK - when I compare the quantiles I estimate from the randomly generated data they are fairly close to the initial values. However, when I look at the histogram something funny is happening at the tail where it is looks like my function is consistently generating more values than it should according to the pdf. This function consistently does that across all my point-estimates and even though I can look at the individual quantiles and they seem close, I can tell that the overall Monte Carlo simulation is demonstrating higher estimates for the 50% percentile than I expect. Here is a plot of my histogram of the random samples.
Any tips or advice would be very welcome. I think the best route would be to fit an exponential distribution to the CDF, but I'm struggling to do that. All "fitting" assumes that you have data that needs to be fitted -- this is more arbitrary than that.

R absolute value of residuals with log transformation

I have a linear model in R of the form
lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
I want to interpret the residuals but get them back on the scale of num_encounters. I have seen residuals.lm(x, type="working") and residuals.lm(x, type="response") but I'm not sure about the values returned by them. Do I for instance still need to use exp() to get the residual values back on the num_encounters scale? Or are they already on that scale? I want to plot these absolute values back, both in a histogram and in a raster map afterwards.
EDIT:
Basically my confusion is that the following code results in 3 different histograms, while I was expecting the first 2 to be identical.
df$predicted <- exp(predict(x, newdata=df))
histogram(df$num_encounters-df$predicted)
histogram(exp(residuals(x, type="response")))
histogram(residuals(x, type="response"))
I want to interpret the residuals but get them back on the scale of
num_encounters.
You can easily calculate them:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
In addition what #Roland suggests, which indeed is correct and works, the problem with my confusion was just basic high-school logarithm algebra.
Indeed the absolute response residuals (on the scale of the original dependent variable) can be calculated as #Roland says with
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
res <- df$num_encounters - exp(predict(mod))
If you want to calculate them from the model residuals, you need to keep logarithm substraction rules into account.
log(a)-log(b)=log(a/b)
The residual is calculated from the original model. So in my case, the model predicts log(num_encounters). So the residual is log(observed)-log(predicted).
What I was trying to do was
exp(resid) = exp(log(obs)-log(pred)) = exp(log(obs/pred)) = obs/pred
which is clearly not the number I was looking for. To get the absolute response residual from the model response residual, this is what I needed.
obs-obs/exp(resid)
So in R code, this is what you could also do:
mod <- lm(log(num_encounters) ~ log(distance)*sampling_effort, data=df)
abs_resid <- df$num_encounters - df$num_encounters/exp(residuals(mod, type="response"))
This resulted in the same number as with the method described by #Roland which is much easier of course. But at least I got my brain lined up again.

exponential regression with R ( and negative values)

I am trying to fit a curve to a set of data points but did not succeed. So I ask you.
plot(time,val) # look at data
exponential.model <- lm(log(val)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit,lwd=2) # show the fitted line
My only problem is, that my data contains negative values and so log(val) produces a lot of NA making the model computation crash.
I know that my data does not necessarily look like exponential , but I want to see the fit anyway. I also used another program which shows me val=27.1331*exp(-time/2.88031) is a nice fit but I do not know, what I am doing wrong.
I want to compute it with R.
I had the idea to shift data so no negative values remain, but result is poor and quite sure wrong.
plot(time,val+20) # look at data
exponential.model <- lm(log(val+20)~ a) # compute model
fit <- exp(predict(exponential.model,list(Time=time))) # create the fitted curve
plot(time,val)#plot it again
lines(time, fit-20,lwd=2) # show the (BAD) fitted line
Thank you!
I figured some things out and have a satisfying solution.
exponential.model <- lm(log(val)~ a) # compute model
The log(val) term is trying to rescale the values, so a linear model can be applied. Since this not possible to my values, you have to use a non-linear model (nls).
exponential.model <- nls(val ~ a*exp(b*time), start=c(b=-0.1,h=30))
This worked fine for me.
satisfying fit

Standard error in glm output

I am using r glm to model Poisson data binned by year. So I have x[i] counts with T[i] exposure in each year, i. The r glm with poisson family log link output produces model coefficients a, b for y = a + bx.
What I need is the standard error of (a + bx) not the standard error of a or the standard error of b. The documentation describing a solution I am trying to implement says this should be calculated by the software because it is not straightforward to calculate from the parameters for a and b. Perhaps SAS does the calc, but I am not recognizing it in R.
I am working working through section 7.2.4.5 of the Handbook of Parameter Estimation (NUREG/CR-6823, a public document) and looking at eq 7.2. Also not a statistician so I am finding this is very hard to follow.
The game here is to find the 90 percent simultaneous confidence interval on the model output, not the confidence interval at each year, i.
Adding this here so I can show some code. The first answer below appears to get me pretty close. A statistician here put together the following function to construct the confidence bounds. This appears to work.
# trend line simultaneous confidence intervals
# according to HOPE 7.2.4.5
HOPE = function(x, ...){
t = data$T
mle<-predict(model, newdata=data.frame(x=data$x), type="response")
se = as.data.frame(predict(model, newdata=data.frame(x=data$x), type="link", se.fit=TRUE))[,2]
chi = qchisq(.90, df=n-1)
upper = (mle + (chi * se))/t
lower = (mle - (chi * se))/t
return(as.data.frame(cbind(mle, t, upper, lower)))
}
I think you need to provide the argument se.fit=TRUE when you create the prediction from the model:
hotmod<-glm(...)
predz<-predict(hotmod, ..., se.fit=TRUE)
Then you should be able to find the estimated standard errors using:
predz$se.fit
Now if you want to do it by hand on this software, it should not be as hard as you suggest:
covmat<-vcov(hotmod)
coeffs<-coef(hotmod)
Then I think the standard error should simply be:
sqrt(t(coeffs) %*% covmat %*% coeffs)
The operator %*% can be used for matrix multiplication in this software.

OHCL Time Series - Anomaly Detection with Multivariate Gaussian Distribution

I have a OHLC time series for some stock prices:
library(quantmod)
library(mnormt)
library(MASS)
download.file("http://dl.dropbox.com/u/25747565/941.RData", destfile="test.RData")
load("test.RData")
chartSeries(p)
As you can see from the plot, there are two downward spikes, most likely due to some sort of data error. I want to use a multivariate Gaussian to detect the rows which contain these two offending data points.
> x[122,]
941.Open 941.High 941.Low 941.Close
85.60 86.65 5.36 86.20
> x[136,]
941.Open 941.High 941.Low 941.Close
84.15 85.60 54.20 85.45
Here is my code to fit the distribution and calculate the probabilities of each data point:
x <- coredata(p[,1:4])
mu <- apply(x, 2, mean)
sigma <- cov.rob(x)$cov
prob <- apply(x, 1, dmnorm, mean = mu, varcov = sigma, log = TRUE)
However, this code throws up the following error:
Error in pd.solve(varcov, log.det = TRUE) : x appears to be not symmetric
This error did not come up when I used the standard cov() function to calculate the covariance matrix, but only with the Robust covariance matrix function. The covariance matrix itself looks quite benign to me so I'm not sure what is going on. The reasons I want to use a robust estimation of the covariance matrix is because the standard covariance matrix was giving a few false positives as I was including anomalies in my training set.
Can someone tell me:
a) how to fix this
b) if this approach even makes sense
Thank you!
P.S. I considered posting this on Cross Validated but thought SO was more appropriate as this seems like a "programming" issue.
You are going right. I don't know the how to fix the error but I can suggest you how to find the indices of novel points.
Your training data may have small number of anomalies otherwise cross-validation and test results will suffer
Calculate mean and covariance in MATLAB
as
mu=mean(traindata);
sigma=cov(traindata);
set epsilon as per your requirement
count=0;
Actual=[];
for j=1:size(cv,1)
p=mvnpdf(cv(j,:),mu,sigma);
if p<eplison
count=count+1;
Actual=[Actual;j];
fprintf('j=%d \t p=%e\n',j,p);
end
end
tune threshold if results are not satisfactory
Evaluate model using F-1 score(if F-1 score is 1, you did it right)
Apply the model on test data
done!

Resources