calculate distance between regression line and datapoint - r

I wonder if there is a way to calculate the distance between a abline in a plot and a datapoint? For example, what is the distance between concentration == 40 with signal == 643 (element 5) and the abline?
concentration <- c(1,10,20,30,40,50)
signal <- c(4, 22, 44, 244, 643, 1102)
plot(concentration, signal)
res <- lm(signal ~ concentration)
abline(res)

You are basically asking for the residuals.
R> residuals(res)
1 2 3 4 5 6
192.61 12.57 -185.48 -205.52 -26.57 212.39
As an aside, when you fit a linear regression, the sum of the residuals is 0:
R> sum(residuals(res))
[1] 8.882e-15
and if the model is correct, should follow a Normal distribution - qqnorm(res).
I find working with the standardised residuals easier.
> rstandard(res)
1 2 3 4 5 6
1.37707 0.07527 -1.02653 -1.13610 -0.15845 1.54918
These residuals have been scaled to have mean zero, variance (approximately) equal to one and have a Normal distribution. Outlying standardised residuals are those larger that +/- 2.

You can use the function below:
http://paulbourke.net/geometry/pointlineplane/pointline.r
Then just extract the slope and intercept:
> coef(res)
(Intercept) concentration
-210.61098 22.00441
So your final answer would be:
concentration <- c(1,10,20,30,40,50)
signal <- c(4, 22, 44, 244, 643, 1102)
plot(concentration, signal)
res <- lm(signal ~ concentration)
abline(res)
cfs <- coef(res)
distancePointLine(y=signal[5], x=concentration[5], slope=cfs[2], intercept=cfs[1])
If you want a more general solution to finding a particular point, concentration == 40 returns a Boolean vector of length length(concentration). You can use that vector to select points.
pt.sel <- ( concentration == 40 )
> pt.sel
[1] FALSE FALSE FALSE FALSE TRUE FALSE
> distancePointLine(y=signal[pt.sel], x=concentration[pt.sel], slope=cfs["concentration"], intercept=cfs["(Intercept)"])
1.206032
Unfortunately distancePointLine doesn't appear to be vectorized (or it does, but it returns a warning when you pass it a vector). Otherwise you could get answers for all points just by leaving the [] selector off the x and y arguments.

Related

Average Mean w/ Forecast Horizon > 1 in R

I use the updated greybox package in R to forecast the consecutive 2 values (horizon=2) with a moving average scheme (See the first line of code below), where the window size is equal to 3.
For example, the overall goal is to take the average of (1+2+3)/3 = "2" as the forecasted value in horizon 1 (h=1) and then make use of the predicted value in h=1 for h=2, where (2+3+"2")=2,3334.
The following forecast window will make use of the window (2+3+4), where 4 is the actual value to predict the next h1 and h2, which equals 3 and 3,3334 respectively.
Yet, the prediction result I want "ValuesMA[[3]]" only emits one row, i.e. values for the first horizon. But it should be equal to the predifined horizon, which is two.
I have a code for an AR1 process which works perfectly (Second line of code). At the end I add an MAE test statistic to evaluate the model.
Can anyone help?
Thank you!
This is the underlying code I use:
#data
z <- c(1,2,3,4,5,6,7)
ourCall <- "mean(x=data,n.ahead=h)"
ourValue <- c("pred")
# Return a list for an forecasting horizon h
ValuesMA <- ro(z, h=2, origins=3, call=ourCall, ci=TRUE, co=TRUE)
ValuesMA[[3]]
**Which yields:**
origin3 origin4 origin5
[1,] 2 3 4
**But I want:**
origin3 origin4 origin5
[1,] 2 3 4
[2,] 2,3334 3,3334 4,3334
#data
z <- c(1,2,3,4,5,6,7)
# ci defines constant in-sample window size, co defines whether the holdout sample window size should be constant
ourCall <- "predict(arima(x=data,order=c(1,0,0)),n.ahead=h)"
# Ask for predicted values and the standard error
ourValue <- c("pred","se")
# Return a list for an forecasting horizon h with a rolling holdout size equal to origin
ValuesAR1 <- ro(z, h=2, origins=3, call=ourCall, value=ourValue, ci=TRUE, co=TRUE)
# calculate MAE
MAE_AR1 <- apply(abs(ValuesAR1$holdout - ValuesAR1$pred),1,mean,na.rm=TRUE) / mean(ValuesAR1$actuals)
ValuesAR1[[3]]
**Which yields:**
> ValuesAR1[[3]]
origin3 origin4 origin5
h1 2 3 4
h2 2 3 4
For further reading see: https://cran.r-project.org/web/packages/greybox/vignettes/ro.html

Use arima.sim to simulate ARIMA 1,1,1 with drift in R

I am trying to use ARIMA sim package to simulate an ARIMA simulation with a drift. My problem is that I cannot seem to get it to work.
I need to get something like this:
My code is producing this though:
> mean(datatime)
[1] 15881.56
> sd(datatime)
[1] 8726.893
> length(datatime)
[1] 123
# The mean and variance from the original series
originalseriesmean = 15881.56
originalseriesvariance = 8726.893*8726.893
originalseriesn=123
# Simulation using arima.sim
ts.sim <- arima.sim(model=list(c(1,1,1)), n = 123, mean=190,sd=69.2863)
ts.plot(ts.sim)
How do I add a drft term to this function to make it look like the simulation before?
ARIMA process doesn't have any drift/trend by definition. Inspired by this answer on cross validated arima with trend and taking into consideration the value you want:
set.seed(123)
intercept <- 4500
b <- (32000 - intercept) / 123
x <- 1:123
y <- b * x + arima.sim(model=list(c(1, 0, 1)),
n = 123, mean=intercept, sd=2000)
> sd(y)
[1] 8020
> mean(y)
[1] 18370
The argument mean gives you the intercept of the process (where it starts), to obtain the variance you should detrend your process because the mean value and the sd values you give are with trend wherease to simulate such process you should decompose your process into noise + trend.

plotting a fitted segmented linear model shows more break points than what is estimated

I was helping a friend with segmented regressions today. We were trying to fit a piecewise regression with a breakpoints to see if it fits data better than a standard linear model.
I stumbled across a problem I cannot understand. When fitting a piecewise regression with a single breakpoint with the data provided, it does indeed fit a single breakpoint.
However, when you predict from the model it gives what looks like 2 breakpoints. When plotting the model using plot.segmented() this problem does not happen.
Anyone have any idea what is going on and how I can get the proper predictions (and standard errors etc)? Or what I am doing wrong in the code in general?
# load packages
library(segmented)
# make data
d <- data.frame(x = c(0, 3, 13, 18, 19, 19, 26, 26, 33, 40, 49, 51, 53, 67, 70, 88
),
y = c(0, 3.56211608128595, 10.5214485148819, 3.66063708049802, 6.11000808621074,
5.51520423804034, 7.73043895812661, 7.90691392857039, 6.59626527933846,
10.4413913666936, 8.71673928545967, 9.93374157928462, 1.214860139929,
3.32428882257746, 2.65223361387063, 3.25440939462105))
# fit normal linear regression and segmented regression
lm1 <- lm(y ~ x, d)
seg_lm <- segmented(lm1, ~ x)
slope(seg_lm)
#> $x
#> Est. St.Err. t value CI(95%).l CI(95%).u
#> slope1 0.17185 0.094053 1.8271 -0.033079 0.37677000
#> slope2 -0.15753 0.071933 -2.1899 -0.314260 -0.00079718
# make predictions
preds <- data.frame(x = d$x, preds = predict(seg_lm))
# plot segmented fit
plot(seg_lm, res = TRUE)
# plot predictions
lines(preds$preds ~ preds$x, col = 'red')
Created on 2018-07-27 by the reprex
package (v0.2.0).
It is a pure plotting issue.
#Call: segmented.lm(obj = lm1, seg.Z = ~x)
#
#Meaningful coefficients of the linear terms:
#(Intercept) x U1.x
# 2.7489 0.1712 -0.3291
#
#Estimated Break-Point(s):
#psi1.x
# 37.46
The break point is estimated to be at x = 37.46, which is not any of the sampling locations:
d$x
# [1] 0 3 13 18 19 19 26 26 33 40 49 51 53 67 70 88
If you make your plot with fitted values at those sampling locations,
preds <- data.frame(x = d$x, preds = predict(seg_lm))
lines(preds$preds ~ preds$x, col = 'red')
You won't visually see those fitted two segments join up at the break points, as lines just line up fitted values one by one. plot.segmented instead would watch for the break points and make the correct plot.
Try the following:
## the fitted model is piecewise linear between boundary points and break points
xp <- c(min(d$x), seg_lm$psi[, "Est."], max(d$x))
yp <- predict(seg_lm, newdata = data.frame(x = xp))
plot(d, col = 8, pch = 19) ## observations
lines(xp, yp) ## fitted model
points(d$x, seg_lm$fitted, pch = 19) ## fitted values
abline(v = d$x, col = 8, lty = 2) ## highlight sampling locations
I cannot specifically answer because I am not familiar to the software that you used. Nevertheless, I try with my own software (home made) and I got this :
Case of two connected segments :
This appears consistent with your result.
Case of two not connected segments :
Case of three connected segments :
One observe that the Mean Square Error is the smallest in case of two not connected segments, which is not surprising with so large scatter.
The case of three connected segments is interesting. The result is intermediate between the two others. The added segment makes an almost vertical link between the two other segments.
Well, this doesn't explain the strange result from the software that you use. I wonder why this software doesn't find the smallest MSE with three segments.
The prediction that you got (two large segments linked by a very small one) gives exactly the same MSE than without the small segment, insofar there is no experimental point related the small segment. One can find an infinity of equivalent solutions in adding "dummy" small segments insofar there is no experimental point related to them.
This is illustrated below, with a magnification of the "branching zone" to make it more lisible.
The 2 segments solution is (AC)+(CB).
The first 3 segments solution is (AD)+(DE)+(EB).
Another 3 segments solution is (AF)+(FG)+(GB).
Another 3 segments solution is (AH)+(HI)+(IB).
One can imagine many other...
All those solutions have the same MSE. So, they can be consider as equivalent on the statistical viewpoint with regard to the MSE as criteria.

Repeat simulation of test scores 1000 times

I want to simulate the problem below in R and calculate the average probability based on 1000 simulations -
Scores on a test are normally distributed with mean 70 and std dev 10.
Estimate the probability that among 75 randomly selected students at least 22 score greater than 78
This is what I have done so far
set.seed(1)
scores = rnorm(1000,70,10)
head(scores)
hist(scores)
sm75=sample(scores,75)
length(sm75[sm75>78])/75
#[1] 0.1866667
However, this only gives me only one iteration, I want 1000 iterations and then take the average of those 1000 probabilities. I believe some kind of control structure using for loop can be implemented. Also, is there an easier way through "apply" family of functions?
At the end of the day you are testing whether at least 22 students score higher than 78, which can be compactly computed with:
sum(rnorm(75, 70, 10) > 78) >= 22
Breaking this down a bit, rnorm(75, 70, 10) returns the 75 scores, which are normally distributed with mean 70 and standard deviation 10. rnorm(75, 70, 10) > 78 is a vector of length 75 that indicates whether or not each of these scores is above 78. sum(rnorm(75, 70, 10) > 78) converts each true to a 1 and each false to a 0 and sums these values up, meaning it counts the number of the 75 scores that exceed 78. Lastly we test whether the sum is 22 or higher with the full expression above.
replicate can be used to replicate this any number of times. So to see the breakdown of 1000 simulations, you can use the following 1-liner (after setting your random seed, of course):
set.seed(144)
table(replicate(1000, sum(rnorm(75, 70, 10) > 78) >= 22))
# FALSE TRUE
# 936 64
In 64 of the replicates, at least 22 students scored above a 78, so we estimate the probability to be 6.4%.
Probability is calculated as number of favourable outcomes / the total number of outcomes. So..
> scores <- sample(rnorm(1000,70,10),75)
> probability <- length(subset(scores,scores>78))/length(scores)
> probability
[1] 0.28
However, you want to do this a 1000 times, and then take an average.
> mean(replicate(1000, {scores<-sample(rnorm(1000,70,10),75);length(subset(scores,scores>78))/length(scores)}))
[1] 0.2133333

What is the "effects" returned by `aov` and `lm`?

I would like to ask the difference between $coefficients and $effects in aov output.
Here the f1 factor and the interaction f1 * f2 are significant. I want to interpret the effect of that factor on the response and I thought that the $effects is what I needed.
Let’s consider the following simple following data set.
f1 <- c(1,1,0,0,1,1,0,0)
f2 <- c(1,0,0,1,1,0,0,1)
r <- c(80, 50, 30, 10, 87,53,29,8)
av <- aov(r ~ f1 * f2)
summary(av)
av$coefficients
av$effects
plot(f1, r)
It seems that the response is being increased by 48.25 units because of f1
mean(r[f1==1]) - mean(r[f1==0]).
But I can’t really see that on $effects output. What does the $effects output really tell me?
Effects are rotated response values according to the QR factorization for design matrix. Check:
all.equal(qr.qty(av$qr, r), unname(av$effects))
# [1] TRUE
Effects are useful for finding regression coefficients from QR factorization:
all.equal(backsolve(av$qr$qr, av$effects), unname(coef(av)))
# [1] TRUE
They can also be used to find fitted values and residuals:
e1 <- e2 <- av$effects
e1[(av$rank+1):length(e1)] <- 0
e2[1:av$rank] <- 0
all.equal(unname(qr.qy(av$qr, e1)), unname(fitted(av)))
# [1] TRUE
all.equal(unname(qr.qy(av$qr, e2)), unname(residuals(av)))
# [1] TRUE
So in summary, effects are representation of data at rotated domain, and is all least square regression is about.

Resources