Find Cook's Distance on Predicted Values for LM - r

Problem
I would like to use Cook's distance to identify outliers in my predicted data.
Background
I know it is easy to find the outliers in the original data used to build a linear model using cooks.distance() (illustrated in Example 1 below).
More Explanation of Problem
When I fit new data with that model (using predict()), I can't see how to get the Cook's distance on the new points since cooks.distance() only operates on a model object. I understand that it is calculated by a leave-one-out method iteratively rebuilding the model so perhaps it doesn't make sense to calculate it on fitted values but I was hoping that I'm missing something simple about how one might approach this.
Desired Output
In Example 2 below I show the predicted values where I'd like to highlight outliers in by their Cook's D, but since I didn't know how to do it I just used their residual to illustrate something close to my desired output.
Example 1
# subset data
a <- mtcars[1:16,]
# build model on one half
m <- lm(mpg ~ disp, a)
# find outliers
c <- cooks.distance(m)
# visualize outliers with cook's d
pal <- colorRampPalette(c("black", "red"))(102)
with(a,
plot(mpg ~ disp,
col = pal[1 + round(100 * scale(c, min(c), max(c)))],
pch = 19,
main = "Color by Cook's D")); abline(m)
Example 2
# predict on full data and add residuals
b <- mtcars
b$pred_mpg <- predict(m, mtcars)
b$resid <- b$mpg - b$pred_mpg
# visualize outliers in full data by residuals
with(b,
plot(mpg ~ disp,
pch = 19,
col = pal[1 + round(100 * scale(resid, min(resid), max(resid)))],
main = "Color by Residual")); abline(m)
Created on 2022-03-10 by the reprex package (v2.0.1)

Related

Obtaining fitted values from second-order quantile regressions

I'm sure this is easily resolvable, but I have a question regarding quantile regression.
Say I have a data frame which follows the trend of a second-order polynomial curve and I construct a quantile regression fitted through different parts of the data:
##Data preperation
set.seed(5)
d <- data.frame(x=seq(-5, 5, len=51))
d$y <- 50 - 0.3*d$x^2 + rnorm(nrow(d))
##Quantile regression
Taus <- c(0.1,0.5,0.9)
QUA<-rq(y ~ 1 + x + I(x^2), tau=Taus, data=d)
plot(y~x,data=d)
for (k in 1:length(Taus)){
curve((QUA$coef[1,k])+(QUA$coef[2,k])*(x)+(QUA$coef[3,k])*(x^2),lwd=2,lty=1, add = TRUE)
}
I can obtain the maximum y value through the 'predict.rq' function and you can see this the following plot.
##Maximum prediction
Pred_df<- as.data.frame(predict.rq(QUA))
apply(Pred_df,2,max)
So my question is how do I obtain the x-value which corresponds to the maximum y-value (i.e. the break in slope) for each quantile?
The package broom could be very useful here:
library(broom)
library(dplyr)
augment(QUA) %>%
group_by(.tau) %>%
filter(.fitted == max(.fitted))

Add raw data points to jp.int (sjPlot)

For my manuscript, I plotted a lme with an interaction of two continuous variables:
Create data
mydata <- data.frame( SID=sample(1:150,400,replace=TRUE),age=sample(50:70,400,replace=TRUE), sex=sample(c("Male","Female"),200, replace=TRUE),time= seq(0.7, 6.2, length.out=400), Vol =rnorm(400),HCD =rnorm(400))
mydata$time <- as.numeric(mydata$time)
Run the model:
model <- lme(HCD ~ age*time+sex*time+Vol*time, random=~time|SID, data=mydata)
Make plot:
sjp.int(model, swap.pred=T, show.ci=T, mdrt.values="meansd")
The reviewer now wants me to add the raw data points to this plot. How can I do this? I tried adding geom_point() referring to mydata, but that is not possible.
Any ideas?
Update:
I thought that maybe I could extract the random slope of HCD and then residuals HCD for the covariates and also residuals Vol for the covariates and plot those two to make things easier (then I could plot the points in a 2D plot).
So, I tried to extract the slopes and use these to fit a linear regression, but the results are different (in the reproducible example less significant, but in my data: the interaction became non-significant (and was significant in the lme)). Not sure what that means or whether this just shows that I should not try to plot it this way.
get the slopes:
model <- lme(HCD ~ time, random=~time|SID, data=mydata)
slopes <- rbind(row.names(model$coefficients$random$SID), model$coef$random$SID[,2])
slopes2 <- data.frame(matrix(unlist(slopes), nrow=144, byrow=T))
names(slopes2)[1] <- "SID"
names(slopes2)[2] <- "slopes"
(save the slopes2 and reopen, because somehow R sees it as a factor)
Then create a cross-sectional dataframe and merge the slopes:
mydata$time2 <- round(mydata$time)
new <- reshape(mydata,idvar = "SID", timevar="time2", direction="wide")
newdata <- dplyr::left_join(new, slop, by="SID")
The lm:
modelw <- lm(slop$slopes ~ age.1+sex.1+Vol.1, data=newdata)
Vol now has a p-value of 0.8 (previously this was 0.14)

plot logistic regression line over heat plot

My data is binary with two linear independent variables. For both predictors, as they get bigger, there are more positive responses. I have plotted the data in a heatplot showing density of positive responses along the two variables. There are the most positive responses in the top right corner and negative responses in the bottom left, with a gradient change visible along both axes.
I would like to plot a line on the heatplot showing where a logistic regression model predicts that positive and negative responses are equally likely. (My model is of the form response~predictor1*predictor2+(1|participant).)
My question: How can I figure out the line based on this model at which the positive response rate is 0.5?
I tried using predict(), but that works the opposite way; I have to give it values for the factor rather than giving the response rate I want. I also tried using a function that I used before when I had only one predictor (function(x) (log(x/(1-x))-fixef(fit)[1])/fixef(fit)[2]), but I can only get single values out of that, not a line, and I can only get values for one predictor at a time.
Using a simple example logistic regression model fitted to the mtcars dataset, and the algebra described here, I can produce a heatmap with a decision boundary using:
library(ggplot2)
library(tidyverse)
data("mtcars")
m1 = glm(am ~ hp + wt, data = mtcars, family = binomial)
# Generate combinations of hp and wt across their observed range. Only
# generating 50 values of each here, which is not a lot but since each
# combination is included, you get 50 x 50 rows
pred_df = expand.grid(
hp = seq(min(mtcars$hp), max(mtcars$hp), length.out = 50),
wt = seq(min(mtcars$wt), max(mtcars$wt), length.out = 50)
)
pred_df$pred_p = predict(m1, pred_df, type = "response")
# For a given value of hp (predictor1), find the value of
# wt (predictor2) that will give predicted p = 0.5
find_boundary = function(hp_val, coefs) {
beta_0 = coefs['(Intercept)']
beta_1 = coefs['hp']
beta_2 = coefs['wt']
boundary_wt = (-beta_0 - beta_1 * hp_val) / beta_2
}
# Find the boundary value of wt for each of the 50 values of hp
# Using the algebra in the linked question you can instead find
# the slope and intercept of the boundary, so you could potentially
# skip this step
boundary_df = pred_df %>%
select(hp) %>%
distinct %>%
mutate(wt = find_boundary(hp, coef(m1)))
ggplot(pred_df, aes(x = hp, y = wt)) +
geom_tile(aes(fill = pred_p)) +
geom_line(data = boundary_df)
Producing:
Note that this only takes into account the fixed effects from the model, so if you want to somehow take into account random effects this could be more complex.

Predict Future values using polynomial regression in R

Was trying to predict the future value of a sample using polynomial regression in R. The y values within the sample forms a wave pattern.
For example
x = 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
y= 1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4
But when the graph is plotted for future values the resultant y values was completely different from what was expected. Instead of a wave pattern, was getting a graph where the y values keep increasing.
futurY = 17,18,19,20,21,22
Tried different degrees of polynomial regression, but the predicted results for futurY were drastically different from what was expected
Following is the sample R code which was used to get the results
dfram <- data.frame('x'=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
dfram$y <- c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4)
plot(dfram,dfram$y,type="l", lwd=3)
pred <- data.frame('x'=c(17,18,19,20,21,22))
myFit <- lm(y ~ poly(x,5), data=dfram)
newdata <- predict(myFit, pred)
print(newdata)
plot(pred[,1],data.frame(newdata)[,1],type="l",col="red", lwd=3)
Is this the correct technique to be used for predicting the unknown future y values OR should I be using other techniques like forecasting?
# Reproducing your data frame
dfram <- data.frame("x" = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
"y" = c(1,2,3,4,5,4,3,2,1,0,1,2,3,4,5,4))
From your graph I've got the phase and period of the signal. There're better ways of calculating that automatically.
# Phase and period
fase = 1
per = 10
In the linear model function I've put the triangular signal equations.
fit <- lm(y ~ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * (x-fase)%%(per/2))
+ I((((trunc((x-fase)/(per/2))%%2)*2)-1) * ((per/2)-((x-fase)%%(per/2))))
,data=dfram)
# Predict the old data
p_olddata <- predict(fit,type="response")
# Predict the new data
newdata <- data.frame('x'=c(17,18,19,20,21,22))
p_newdata <- predict(fit,newdata,type="response")
# Ploting Old and new data
plot(x=c(dfram$x,newdata$x),
y=c(p_olddata,p_newdata),
col=c(rep("blue",length(p_olddata)),rep("green",length(p_olddata))),
xlab="x",
ylab="y")
lines(dfram)
Where the black line is the original signal, the blue circles are the prediction for the original points and the green circles are the prediction for the new data.
The graph shows a perfect fit for the model because there's no noise in the data. In a real dataset you may find it so the fit will not look as nice as that.

How to add RMSE, slope, intercept, r^2 to R plot?

How can I add RMSE, slope, intercept and r^2 to a plot using R? I have attached a script with sample data, which is a similar format to my real dataset--unfortunately, I am at a stand-still. Is there an easier way to add these statistics to the graph than to create an object from an equation and insert that into text()? I would ideally like the statistics to be displayed stacked on the graph. How can I accomplish this?
## Generate Sample Data
x = c(2,4,6,8,9,4,5,7,8,9,10)
y = c(4,7,6,5,8,9,5,6,7,9,10)
# Create a dataframe to resemble existing data
mydata = data.frame(x,y)
#Plot the data
plot(mydata$x,mydata$y)
abline(fit <- lm(y~x))
# Calculate RMSE
model = sqrt(deviance(fit)/df.residual(fit))
# Add RMSE value to plot
text(3,9,model)
Here is a version using base graphics and ?plotmath to draw the plot and annotate it
## Generate Sample Data
x = c(2,4,6,8,9,4,5,7,8,9,10)
y = c(4,7,6,5,8,9,5,6,7,9,10)
## Create a dataframe to resemble existing data
mydata = data.frame(x,y)
## fit model
fit <- lm(y~x, data = mydata)
Next calculate the values you want to appear in the annotation. I prefer bquote() for this, where anything marked-up in .(foo) will be replaced by the value of the object foo. The Answer #mnel points you to in the comments uses substitute() to achieve the same thing but via different means. So I create objects in the workspace for each value you might wish to display in the annotation:
## Calculate RMSE and other values
rmse <- round(sqrt(mean(resid(fit)^2)), 2)
coefs <- coef(fit)
b0 <- round(coefs[1], 2)
b1 <- round(coefs[2],2)
r2 <- round(summary(fit)$r.squared, 2)
Now build up the equation using constructs described in ?plotmath:
eqn <- bquote(italic(y) == .(b0) + .(b1)*italic(x) * "," ~~
r^2 == .(r2) * "," ~~ RMSE == .(rmse))
Once that is done you can draw the plot and annotate it with your expression
## Plot the data
plot(y ~ x, data = mydata)
abline(fit)
text(2, 10, eqn, pos = 4)
Which gives:

Resources