3D plot of the residual sum of squares in linear regression - r

I'm trying to reproduce Figure 3.2 from the book Introduction to Statistical Learning. Figure describes 3D plot of the residual sum of squares (RSS) on the Advertising data, using Sales as the response and TV as the predictor variable for a number of values for $\beta_0$ and $\beta_1$.
My code is pasted below:
# Read data from URL
data <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
# Extract TV variable
X <- data[, 2]
# Prediction variable with ones in the first column
X <- cbind(1, X)
# Prediction variable
y <- data$Sales
# Define function for RSS
MyRss <- function(beta0, beta1) {
b <- c(beta0, beta1)
rss <- t(y - X %*% b) %*% (y - X %*% b)
return(rss)
}
Now I calculate RSS value for each combination of $\beta_0$ and $\beta_1$ and plot it with persp() function:
b0 <- seq(5, 9, 0.01)
b1 <- seq(0.03, 0.06, 0.001)
z <- outer(b0, b1, function(x,y) mapply(MyRss, x, y))
persp(x = b0, y = b1, z = z)
The derived plot is pasted below:
I don't know where my code fails. Thanks in advance for any pointers.

Related

Replace lm coefficients and calculate results of lm new in R

I am able to change the coefficients of my linear model. Then i want to compare the results of my "new" model with the new coefficients, but R is not calculating the results with the new coefficients.
As you can see in my following example the summary of my models fit and fit1 are excactly the same, though results like multiple R-squared should or fitted values should change.
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
# generate second function for comparing results
fit1 <- fit
# replace coefficients with new values, use whole name which is coefficients:
fit1$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit1)
# Comparing
summary(fit)
summary(fit1)
Thanks in advance
It might be easier to compute the multiple R^2 yourself with the substituted parameters.
mult_r2 <- function(beta, y, X) {
tot_ss <- var(y) * (length(y) - 1)
rss <- sum((y - X %*% beta)^2)
1 - rss/tot_ss
}
(or, more compactly, following the comments, you could compute p <- X %*% beta; (cor(y,beta))^2)
mult_r2(coef(fit), y = model.response(model.frame(fit)), X = model.matrix(fit))
## 0.9931179, matches summary()
Now with new coefficients:
new_coef <- coef(fit)
new_coef[2:3] <- c(5,1)
mult_r2(new_coef, y = model.response(model.frame(fit)), X = model.matrix(fit))
## [1] -343917
That last result seems pretty wild, but the substituted coefficients are very different from the true least-squares coeffs, and negative R^2 is possible when the model is bad enough ...

observed information matrix for logistic model

I have proposed my own model and now am trying to implement it using R, I have got stuck on how to find the observed matrix applying my formula i have use glm() to fit logistic model with penalty term, using binary data set x1, x2, x3 ,y (all binary 0,1) fit1 if the glm() model def.new is the penalise deviance.
X.tilde <- as.matrix(x) # n*p matrix of the data set
W <- Diagonal(length(y), weights) # n*n diagonal matrix of the weights
qq <- exp(fit1$fitted.values)/(1 + exp(fit1$fitted.values)) # n*1 vector (pi=probability of the logistic model )
cc <- t(1 - qq) # n*1 vector
gg <- (dev.new) * t(dev.new) # p*p matrix
ff <- (X.tilde) %*% t(X.tilde) # n*n matrix
pp <- exp(fit1$coefficients)/(1 + exp(fit1$coefficients)) # p*1 matrix
ss <- t(1/(1 + exp(fit1$coefficients))) # p*1 vector
aa <- t(X.tilde) %*% qq %*% cc %*% W %*% (X.tilde) # p*p matrix
firstP <- (aa + (pp * ss)) # p*p matrix
info.mat <- firstP+gg # p*p matrix
info.mat <- as.matrix(info.mat)
this code returns the following error
Error in e1 + Matrix(e2) :
Matrices must have same number of rows for arithmetic
As in my theory the dimension is fine by when I implement its not correct
any help?
r

Is it possible to fit specific slopes to best fit segments of data in R?

Background: I am analyzing oil production data where I plot daily oil rate on the y-axis and a diagnostic "time" factor on the x-axis. This combination tends to exhibit a certain trend depending on the flow regime where there is typically a half slope or quarter slope followed by a unit slope. It is very basic, but the approach is archaic and everything is done manually.
I was wondering if there was a way in R where you can find the segment of the data that best fits a specific slope and fit the associated line over that data maybe up to a R^2 criteria on a log-log plot? Also is there a way to get the point where that slope changes?
example of What the raw data looks like
example of desired end result
what about using a scatterplot?
scatter.smooth(x=data$x, y=data$y, main="y ~ x") # scatterplot
In the future please provide your data in reproducible form so we can work with it. This time I have provided some sample data in the Note at the end.
Let kvalues be the possible indexes of x of the change point. We do not include ones near the ends to avoid numeric problems. Then for each kvalue we perform the regression defined in the regr function and compute the residual sum of squares using deviance. Take the least of thoxe and display that regression. No packages are used.
(If you want to fix the slopes then remove the slope parameters from the formula and starting values and replace them with the fixed values in the formula.)
kvalues <- 5:45
st <- list(a1 = 1, b1 = 1, a2 = 2, b2 = 2)
regr <- function(k) try(nls(y ~ ifelse(x < k, a1 + b1 * x, a2 + b2 * x), start = st))
i <- which.min(sapply(kvalues, function(k) deviance(regr(k))))
k <- kvalues[i]
k; x[k]
## [1] 26
## [1] 26
fm <- regr(k)
fm
## Nonlinear regression model
## model: y ~ ifelse(x < k, a1 + b1 * x, a2 + b2 * x)
## data: parent.frame()
## a1 b1 a2 b2
## 1.507 -1.042 1.173 -2.002
## residual sum-of-squares: 39.52
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 2.917e-09
plot(y ~ x)
lines(fitted(fm) ~ x)
abline(v = x[k])
Note
set.seed(123)
x <- 1:50
y <- 1 - rep(1:2, each = 25) * x + rnorm(50)

R: Confidence intervals on non-linear fit with a non-analytic model

I need to fit x-y data with a model, which is non-analytic. I have a function f(x) that calculates the model for each x numerically, but there is no analytical equation. For the fit, I use optim in R. I minimise RMS between the model and the data. It works well and returns reasonable parameters.
I would like to find confidence intervals (or at least standard errors) on the best-fitting parameters. I found on internet that this can be done from the Hessian matrix, but only if maximising log-likelihood function. I don't know how to do this, all I have is x, y and f(x) from which I find RMS. Alas, I have no good way of estimating errors on y.
How can I find confidence intervals on my fit parameters?
Edit: perhaps an example in R might help explaining what I'm asking for. This example uses a simple analytic function to fit the data, in my real case the function is non-analytic, so I cannot use, e.g., nls.
set.seed(666)
# generate data
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# use optim to fit the model to the data
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x, y)
# best-fitting parameters
best_a <- res$par[1]
best_b <- res$par[2]
The best fitting parameters are a = 0.50 and b = 0.20. I need to find 95% confidence intervals on these.
This is a job for the bootstrap:
(1) create a large number of synthetic datasets x*. These are created by sampling from x with replacement the same number of data as were in x. For example, if your data is (1,2,3,4,5,6) an x* might be (5,2,4,4,2,3) (note that values might appear multiple times, or not at all because we are sampling with replacement)
(2) For each x*, calculate f(x*). If there are other parameters which don't depend on the data, don't change them. (so f(x,a,b,c) becomes f(x*,a,b,c) as long as a,b,c don't depend on x. Call these quantities f*.
(3) You can estimate anything you want from these f*. If you want the standard deviation of f(x), take the standard deviation of f*. If you want the 95% confidence interval, take the range from the 2.5 to the 97.5 percentiles of f*. More formally, if you want to estimate g(f(x)) you estimate it as g(f(x*)).
I should say this is a very practically-oriented explanation of the bootstrap. I have glossed over many theoretical details, but the bootstrap is near-universally applicable (basically as long as the thing you are trying to estimate actually exists, you are usually okay).
To apply this to the example you have given in your code:
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# this is the part where we bootstrap
# use optim to fit the model to the data
best_a <- best_b <- numeric(10000)
for(i in 1:10000){
j <- sample(100,replace=TRUE)
x.boot <- x[j]; y.boot <- y[j]
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x.boot, y.boot)
# best-fitting parameters
best_a[i] <- res$par[1]
best_b[i] <- res$par[2]
}
# now, we look at the *vector* best_a
# for example, if you want the standard deviation of a,
sd(best_a)
# or a 95% confidence interval for b,
quantile(best_b,c(0.025,0.975))

How to simulate quantities of interest using arm or rstanarm packages in R?

I would like to know how to simulate quantities of interest out of a regression model estimated using either the arm or the rstanarm packages in R. I am a newbie in Bayesian methods and R and have been using the Zelig package for some time. I asked a similar question before, but I would like to know if it is possible to simulate those quantities using the posterior distribution estimated by those packages.
In Zelig you can set the values you want for the independent values and it calculates the results for the outcome variable (expected value, probability, etc). An example:
# Creating a dataset:
set.seed(10)
x <- rnorm(100,20,10)
z <- rnorm(100,10,5)
e <- rnorm(100,0,1)
y <- 2*x+3*z+e
df <- data.frame(x,z,e,y)
# Loading Zelig
require(Zelig)
# Model
m1.zelig <- zelig(y ~ x + z, model="ls", data=df)
summary(m1.zelig)
# Simulating z = 10
s1 <- setx(m1.zelig, z = 10)
simulation <- sim(m1.zelig, x = s1)
summary(simulation)
So Zelig keeps x at its mean (20.56), and simulates the quantity of interest with z = 10. In this case, y is approximately 71.
The same model using arm:
# Model
require(arm)
m1.arm <- bayesglm(y ~ x + z, data=df)
summary(m1.arm)
And using rstanarm:
# Model
require(rstanarm)
m1.stan <- stanlm(y ~ x + z, data=df)
print(m1.stan)
Is there any way to simulate z = 10 and x equals to its mean with the posterior distribution estimated by those two packages and get the expected value of y? Thank you very much!
In the case of bayesglm, you could do
sims <- arm::sim(m1.arm, n = 1000)
y_sim <- rnorm(n = 1000, mean = sims#coef %*% t(as.matrix(s1)), sd = sims#sigma)
mean(y_sim)
For the (unreleased) rstanarm, it would be similar
sims <- as.matrix(m1.stan)
y_sim <- rnorm(n = nrow(sims), mean = sims[,1:(ncol(sims)-1)] %*% t(as.matrix(s1)),
sd = sims[,ncol(sims)])
mean(y_sim)
In general for Stan, you could pass s1 as a row_vector and utilize it in a generated quantities block of a .stan file like
generated quantities {
real y_sim;
y_sim <- normal_rng(s1 * beta, sigma);
}
in which case the posterior distribution of y_sim would appear when you print the posterior summary.

Resources