How to add RMSE, slope, intercept, r^2 to R plot? - r

How can I add RMSE, slope, intercept and r^2 to a plot using R? I have attached a script with sample data, which is a similar format to my real dataset--unfortunately, I am at a stand-still. Is there an easier way to add these statistics to the graph than to create an object from an equation and insert that into text()? I would ideally like the statistics to be displayed stacked on the graph. How can I accomplish this?
## Generate Sample Data
x = c(2,4,6,8,9,4,5,7,8,9,10)
y = c(4,7,6,5,8,9,5,6,7,9,10)
# Create a dataframe to resemble existing data
mydata = data.frame(x,y)
#Plot the data
plot(mydata$x,mydata$y)
abline(fit <- lm(y~x))
# Calculate RMSE
model = sqrt(deviance(fit)/df.residual(fit))
# Add RMSE value to plot
text(3,9,model)

Here is a version using base graphics and ?plotmath to draw the plot and annotate it
## Generate Sample Data
x = c(2,4,6,8,9,4,5,7,8,9,10)
y = c(4,7,6,5,8,9,5,6,7,9,10)
## Create a dataframe to resemble existing data
mydata = data.frame(x,y)
## fit model
fit <- lm(y~x, data = mydata)
Next calculate the values you want to appear in the annotation. I prefer bquote() for this, where anything marked-up in .(foo) will be replaced by the value of the object foo. The Answer #mnel points you to in the comments uses substitute() to achieve the same thing but via different means. So I create objects in the workspace for each value you might wish to display in the annotation:
## Calculate RMSE and other values
rmse <- round(sqrt(mean(resid(fit)^2)), 2)
coefs <- coef(fit)
b0 <- round(coefs[1], 2)
b1 <- round(coefs[2],2)
r2 <- round(summary(fit)$r.squared, 2)
Now build up the equation using constructs described in ?plotmath:
eqn <- bquote(italic(y) == .(b0) + .(b1)*italic(x) * "," ~~
r^2 == .(r2) * "," ~~ RMSE == .(rmse))
Once that is done you can draw the plot and annotate it with your expression
## Plot the data
plot(y ~ x, data = mydata)
abline(fit)
text(2, 10, eqn, pos = 4)
Which gives:

Related

Find Cook's Distance on Predicted Values for LM

Problem
I would like to use Cook's distance to identify outliers in my predicted data.
Background
I know it is easy to find the outliers in the original data used to build a linear model using cooks.distance() (illustrated in Example 1 below).
More Explanation of Problem
When I fit new data with that model (using predict()), I can't see how to get the Cook's distance on the new points since cooks.distance() only operates on a model object. I understand that it is calculated by a leave-one-out method iteratively rebuilding the model so perhaps it doesn't make sense to calculate it on fitted values but I was hoping that I'm missing something simple about how one might approach this.
Desired Output
In Example 2 below I show the predicted values where I'd like to highlight outliers in by their Cook's D, but since I didn't know how to do it I just used their residual to illustrate something close to my desired output.
Example 1
# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a,
plot(mpg ~ disp,
col = pal[1 + round(100 * scale(c, min(c), max(c)))],
pch = 19,
main = "Color by Cook's D")); abline(m)
Example 2
# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg
# visualize outliers in full data by residuals
with(b,
plot(mpg ~ disp,
pch = 19,
col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
main = "Color by Residual")); abline(m)
Created on 2022-03-10 by the reprex package (v2.0.1)

Predict Future values using polynomial regression in R

Was trying to predict the future value of a sample using polynomial regression in R. The y values within the sample forms a wave pattern.
For example
x = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
y= 1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4
But when the graph is plotted for future values the resultant y values was completely different from what was expected. Instead of a wave pattern, was getting a graph where the y values keep increasing.
futurY = 17,18,19,20,21,22
Tried different degrees of polynomial regression, but the predicted results for futurY were drastically different from what was expected
Following is the sample R code which was used to get the results
dfram <- data.frame('x'=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
dfram$y <- c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4)
plot(dfram,dfram$y,type="l", lwd=3)
pred <- data.frame('x'=c(17,18,19,20,21,22))
myFit <- lm(y ~ poly(x,5), data=dfram)
newdata <- predict(myFit, pred)
print(newdata)
plot(pred[,1],data.frame(newdata)[,1],type="l",col="red", lwd=3)
Is this the correct technique to be used for predicting the unknown future y values OR should I be using other techniques like forecasting?
# Reproducing your data frame
dfram <- data.frame("x" = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
"y" = c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4))
From your graph I've got the phase and period of the signal. There're better ways of calculating that automatically.
# Phase and period
fase = 1
per = 10
In the linear model function I've put the triangular signal equations.
fit <- lm(y ~ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * (x-fase)%%(per/2))
+ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * ((per/2)-((x-fase)%%(per/2))))
,data=dfram)
# Predict the old data
p_olddata <- predict(fit,type="response")
# Predict the new data
newdata <- data.frame('x'=c(17,18,19,20,21,22))
p_newdata <- predict(fit,newdata,type="response")
# Ploting Old and new data
plot(x=c(dfram$x,newdata$x),
y=c(p_olddata,p_newdata),
col=c(rep("blue",length(p_olddata)),rep("green",length(p_olddata))),
xlab="x",
ylab="y")
lines(dfram)
Where the black line is the original signal, the blue circles are the prediction for the original points and the green circles are the prediction for the new data.
The graph shows a perfect fit for the model because there's no noise in the data. In a real dataset you may find it so the fit will not look as nice as that.

Predict out of sample using flexsurvreg in R

I have the following model in R
library(flexsurv)
data(ovarian)
model = flexsurvreg(Surv(futime, fustat) ~ ecog.ps + rx, data = ovarian, dist='weibull')
model
predict(model,data = ovarian, type = 'response')
The model summary looks like this flexsurvreg model output
I am trying to predict the survival time using the predict function in R and get the following error
error while trying to predict
How can I predict expected lifetime using this flexsurvreg model?
I understand that the documentation mentions a totlos.fs function, but this data does not seem to have a trans variable that totlos.fs requires to provide an output.
If there is no other alternative to totlos.fs how can I create a trans variable in this data and handle it along with existing covariates?
Please advise.
Section 3 of the supplementary examples doc for the flexsurv documentation has an example in which the predicted values are calculated directly using the model equation. As you are using the Weibull distribution (with n=2 parameters) I believe this should work:
pred.model <- model.matrix(model) %*% model$res[-(1:n),"est"]
Cheers
Nik,
I know your question is an old one, but see below how I hacked a way to do it. It involves retrieving the shape and rate parameters from your fit of test data, then instead of predict, you use the qgompertz() from flexsurv. Please excuse the use of my own encapsulated example code, but you should be able to follow along.
# generate the training data "lung1" from data(lung) in survival package
# hacked way for truncating the lung data to 2 years of follow up
require(survival)
lung$yrs <- lung$time/365
lung1 <- lung[c("status", "yrs")]
lung1$status[ lung1$yrs >2] <- 1
lung1$yrs[ lung1$yrs >2] <- 2
# from the training data build KM to obtain survival %s
s <- Surv(time=lung1$yrs, event=lung1$status)
km.lung <- survfit(s ~ 1, data=lung1)
plot(km.lung)
# generate dataframe to use later for plotting
cut.length <- sum((km.lung$time <= 2)) # so I can create example test data
test.data <- data.frame(yrs = km.lung$time[1:cut.length] , surv=round(km.lung$surv[1:cut.length], 3))
##
## doing the same as above with gompertz
##
require(flexsurv) #needed to run gompertz model
s <- Surv(time=lung1$yrs, event=lung1$status)
gomp <- flexsurvreg(s ~ 1, data=lung1, dist="gompertz") # run this to get shape and rate estimates for gompertz
gomp # notice the shape and rate values
# create variables for these values
g.shape <- 0.5866
g.rate <- 0.5816
##
## plot data and vizualize the gomperts
##
# vars for plotting
df1 <- test.data
xvar <- "yrs"
yvar <- "surv"
extendedtime <- 3 #
ylim1 <- c(0,1)
xlim1 <- c(0, extendedtime)
# plot the survival % for training data
plot(df1[,yvar]~df1[,xvar], type="S", ylab="", xlab="", lwd=3, xlim=xlim1, ylim=ylim1)
# Nik--here is where the magic happens... pay special attention to: qgompertz(seq(.01,.99,by=.01), shape=0.58656, rate = .5816)
lines (qgompertz(seq(.01,.99,by=.01), shape=0.58656, rate = .5816) , seq(.99,.01,by=-.01) , col="red", lwd=2, lty=2 )
# generate a km curve from the testing data
s <- Surv(time=lung$yrs, event=lung$status)
km.lung <- survfit(s ~ 1, data=lung)
par(new=T)
# now draw remaining survival curve from the testing section
plot(km.lung$surv[(cut.length+1):length(km.lung$time)]~km.lung$time[(cut.length+1):length(km.lung$time)], type="S", col="blue", ylab="", xlab="", lwd=3, xlim=xlim1, ylim=ylim1)

Modifying a curve to prevent singular gradient matrix at initial parameter estimates

I want to use y=a^(b^x) to fit the data below,
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
When I use the non-linear least squares procedure,
f <- function(x,a,b) {a^(b^x)}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1, b=0.5)))
it produces an error: singular gradient matrix at initial parameter estimates. The result is roughly a = 1.1466, b = 0.6415, so there shouldn't be a problem with intial parameter estimates as I have defined them as a=1, b=0.5.
I have read in other topics that it is convenient to modify the curve. I was thinking about something like log y=log a *(b^x), but I don't know how to deal with function specification. Any idea?
I will expand my comment into an answer.
If I use the following:
y <- c(1.0385, 1.0195, 1.0176, 1.0100, 1.0090, 1.0079, 1.0068, 1.0099, 1.0038)
x <- c(3,4,5,6,7,8,9,10,11)
data <- data.frame(x,y)
f <- function(x,a,b) {a^b^x}
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=0.9, b=0.6)))
or
(m <- nls(y ~ f(x,a,b), data = data, start = c(a=1.2, b=0.4)))
I obtain:
Nonlinear regression model
model: y ~ f(x, a, b)
data: data
a b
1.0934 0.7242
residual sum-of-squares: 0.0001006
Number of iterations to convergence: 10
Achieved convergence tolerance: 3.301e-06
I always obtain an error if I use 1 as a starting value for a, perhaps because 1 raised to anything is 1.
As for automatically generating starting values, I am not familiar with a procedure to do that. One method I have read about is to simulate curves and use starting values that generate a curve that appears to approximate your data.
Here is the plot generated using the above parameter estimates using the following code. I admit that maybe the lower right portion of the line could fit a little better:
setwd('c:/users/mmiller21/simple R programs/')
jpeg(filename = "nlr.plot.jpeg")
plot(x,y)
curve(1.0934^(0.7242^x), from=0, to=11, add=TRUE)
dev.off()

project a linear regression hyper plane to a 2d plot (abline-like)

I have this code
factors<-read.csv("India_Factors.csv",header=TRUE)
marketfactor<-factors[,4]
sizefactor<-factors[,5]
valuefactor<-factors[,6]
dati<-get.hist.quote("SI", quote = "AdjClose", compression = "m")
returns<-diff(dati)
regression<-lm(returns ~ marketfactor + sizefactor + valuefactor,na.action=na.omit)
that does multilinear regression.
I want to plot on a 2D plane the returns against a factor (and this is trivial of course) with superimposed the projection of the linear regression hyperplane for the specific factor. To be more clear the result should be like this: wolfram demonstrations (see the snapshots).
Any help will be greatly appreciated.
Thank you for your time and have a nice week end.
Giorgio.
The points in my comment withstanding, here is the canonical way to generate output from a fitted model in R for combinations of predictors. It really isn't clear what the plots you want are showing, but the ones that make sense to me are partial plots; where one variable is varied over its range whilst holding the others at some common value. Here I use the sample mean when holding a variable constant.
First some dummy data, with only to covariates, but this extends to any number
set.seed(1)
dat <- data.frame(y = rnorm(100))
dat <- transform(dat,
x1 = 0.2 + (0.4 * y) + rnorm(100),
x2 = 2.4 + (2.3 * y) + rnorm(100))
Fit the regression model
mod <- lm(y ~ x1 + x2, data = dat)
Next some data values to predict at using the model. You could do all variables in a single prediction and then subset the resulting object to plot only the relevant rows. Alternatively, more clearly (though more verbose), you can deal with each variable separately. Below I create two data frames, one per covariate in the model. In a data frame I generate 100 values over the range of the covariate being varied, and repeat the mean value of the other covariate(s).
pdatx1 <- with(dat, data.frame(x1 = seq(min(x1), max(x1), length = 100),
x2 = rep(mean(x2), 100)))
pdatx2 <- with(dat, data.frame(x1 = rep(mean(x1), 100),
x2 = seq(min(x2), max(x2), length = 100)))
In the linear regression with straight lines, you really don't need 100 values --- the two end points of the range of the covariate will do. However for models where the fitted function is not linear you need to predict at more locations.
Next, use the model to predict at these data points
pdatx1 <- transform(pdatx1, yhat = predict(mod, pdatx1))
pdatx2 <- transform(pdatx2, yhat = predict(mod, pdatx2))
Now we are ready to draw the partial plots. First compute a range for the y axis - again it is mostly redundant here but if you are adding confidence intervals you will need to include their values below,
ylim <- range(pdatx1$y, pdatx2$y, dat$y)
To plot (here putting two figures on the same plot device) we can use the following code
layout(matrix(1:2, ncol = 2))
plot(y ~ x1, data = dat)
lines(yhat ~ x1, data = pdatx1, col = "red", lwd = 2)
plot(y ~ x2, data = dat)
lines(yhat ~ x2, data = pdatx2, col = "red", lwd = 2)
layout(1)
Which produces

Resources