I'm evaluating the performance of a numerical deterministic model, and I'm evaluating its predictive performance against observed data. I made a scatter plot of the observed (Vsurface) vs modeled (Vmod) data, fit a lm (the red line), and added a 1:1 line. I want to find the point where these two lines intersect so I can document where the model shifts from over-predicting to under-predicting. Is there an easy way to do this? Here is the code for the lm:
lm <- lm(Vmod~Vsurface, data = v)
summary(lm)
Call:
lm(formula = Vmod ~ Vsurface, data = v)
Residuals:
Min 1Q Median 3Q Max
-0.63267 -0.11995 -0.03618 0.13816 0.60314
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.20666 0.06087 3.395 0.00185 **
Vsurface 0.43721 0.06415 6.816 1.05e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2232 on 32 degrees of freedom
Multiple R-squared: 0.5921, Adjusted R-squared: 0.5794
F-statistic: 46.45 on 1 and 32 DF, p-value: 1.047e-07
Here is the plot code:
ggplot(data = v, aes(x = Vsurface, y = Vmod)) +
geom_point(col = "slateblue2") +
geom_smooth(method = "lm", col = "red") +
geom_abline(intercept = 0, slope = 1)
I'm working in R markdown.
Just to explicitly state G5W's comment - the model is a list and the co-efficients can be extracted like this:
lmodel <- lm(Vmod~Vsurface, data = v)
x1 <- lmodel$coefficients[1]/(1-lmodel$coefficients[2])
### x1 is the intersection point
Edited step by step:
x <- rnorm(100,10,2)
y <- rnorm(100,15,3)
lmodel <- lm(y ~x)
lmodel$coefficients
Intercept = 13.6578378
x = 0.1283835
x1 <- lmodel$coefficients[1]/(1 - lmodel$coefficients[2])
x1
15.66955
Related
I am looking to use my PC1 from a PCA in a hierarchical regression analysis to account for additional variation in R. Is this possible?
I ran my pca with the code below in R
pca<- prcomp(my.data[,c(57:62)], center = TRUE,scale. = TRUE)
summary(pca)
str(pca)
fviz_eig(pca)
fviz_pca_ind(pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
ggbiplot(pca)
print(pca)
#some results!
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3
EC 0.5389823 -0.4785188 0.0003197419
temp 0.4787782 0.3590390 0.7913858440
pH 0.5495125 -0.3839466 -0.2673991595
DO. 0.4222624 0.7033461 -0.5497326925
PC4
EC 0.6931938
temp -0.1247834
pH -0.6921840
DO. 0.1574569
Now I hope to use the PC1 as a variable in my models
Somthing like this
m0<- lm(Rel.abund.rotifers~turb+chl.a+PC1,data=my.data)
Any help is very appreciated!
Extract the component scores using pca$x, add them to your dataframe using cbind(), then run your model. Example using mtcars:
pca <- prcomp(mtcars[, 3:6])
mtcars2 <- cbind(mtcars, pca$x)
m0 <- lm(mpg ~ cyl + PC1, data = mtcars2)
summary(m0)
Call:
lm(formula = mpg ~ cyl + PC1, data = mtcars2)
Residuals:
Min 1Q Median 3Q Max
-4.1424 -2.0289 -0.7483 1.3613 6.9373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.99508 4.80433 5.827 2.56e-06 ***
cyl -1.27749 0.77169 -1.655 0.1086
PC1 -0.02275 0.01010 -2.251 0.0321 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.008 on 29 degrees of freedom
Multiple R-squared: 0.7669, Adjusted R-squared: 0.7508
F-statistic: 47.71 on 2 and 29 DF, p-value: 6.742e-10
I am working with an interaction model similar to this one below:
set.seed(1993)
moderating <- sample(c("Yes", "No"),100, replace = T)
x <- sample(c("Yes", "No"), 100, replace = T)
y <- sample(1:100, 100, replace = T)
df <- data.frame(y, x, moderating)
Results <- lm(y ~ x*moderating)
summary(Results)
Call:
lm(formula = y ~ x * moderating)
Residuals:
Min 1Q Median 3Q Max
-57.857 -29.067 3.043 22.960 59.043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.4000 6.1639 8.501 2.44e-13 ***
xYes 8.4571 9.1227 0.927 0.356
moderatingYes -11.4435 8.9045 -1.285 0.202
xYes:moderatingYes -0.1233 12.4563 -0.010 0.992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.82 on 96 degrees of freedom
Multiple R-squared: 0.04685, Adjusted R-squared: 0.01707
F-statistic: 1.573 on 3 and 96 DF, p-value: 0.2009
I'm learning how to calculate the fitted value of a interaction from a regression table. In the example, the base category (or omitted category) is x= No and moderating = No.
Thus far, I know the following fitted values:
#Calulate Fitted Value From a Regression Interaction by hand
#Omitted Variable = X_no.M_no
X_no.M_no <- 52.4000
X_yes.M_no <- 52.4000 + 8.4571
X_no.M_yes <- 52.4000 + -11.4435
X_yes.M_yes #<- ?
I do not understand how the final category, X_yes.M_yes, is calculated. My initial thoughts were X_yes.M_yes <- 52.4000 + -0.1233, (the intercept plus the interaction term) but that is incorrect. I know its incorrect because, using the predict function, the fitted value of X_yes.M_yes = 49.29032, not 52.4000 + -0.1233 = 52.2767.
How do I calculate, by hand, the predicted value of the X_yes.M_yes category?
Here are the predicted values as generated from the predict function in R
#Validated Here Using the Predict Function:
newdat <- NULL
for(m in na.omit(unique(df$moderating))){
for(i in na.omit(unique(df$x))){
moderating <- m
x <- i
newdat<- rbind(newdat, data.frame(x, moderating))
}
}
Prediction.1 <- cbind(newdat, predict(Results, newdat, se.fit = TRUE))
Prediction.1
Your regression looks like this in math:
hat_y = a + b x + c m + d m x
Where x = 1 when "yes" and 0 when "no" and m is similarly defined by moderating.
Then X_yes.M_yes implies x = 1 and m = 1, so your prediction is a + b + c + d.
or in your notation X_yes.M_yes = 52.4000 + 8.4571 - 11.4435 - 0.1233
In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")
I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)