Interpreting multiple predictor polynomial regression output in R - r

I need to export a final multivariate polynomial regression equation from R to another application. I do not understand one portion of the regression output. The regression uses the polym() function. The summary table is posted below.
ploy_lm <- lm(df$SV ~ polym(df$Indy, df$HI, degree = 3, raw = TRUE)
summary(ploy_lm)
The table below says polym input for "df$Indy, df$HI, degree = 3, raw = TRUE".
Estimate
Intercept
-8.903
(polym input)1.o
1.189E0
(polym input)2.o
-1.651E-2
(polym input)1.1
8.247E-4
How do I translate the results into a final regression equation? Does the value at the end of the first column (e.g. from the last row: "polym(df$Indy, df$WM_HI, degree = 3, raw = TRUE)1.1") signify the exponent value?

Here is a simple example with a predefined function:
x1<- runif(20, 1, 20)
x2 <- runif(20, 15, 30)
#define a function for y
y <- (1 - 3*x1 + 1/5*x2 - x1*x2 + 0.013*x1^2 + 0.2 *x2^2)
#add some noise to prevent a warning on the fit
y <- y +rnorm(20, 0, 0.01)
ploy_lm <- lm(y ~ polym(x1, x2, degree = 2, raw = TRUE))
summary(ploy_lm)
Call:
lm(formula = y ~ polym(x1, x2, degree = 2, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-0.017981 -0.007537 0.001757 0.005833 0.018697
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.588e-01 7.158e-02 13.39 2.25e-09 ***
polym(x1, x2, degree = 2, raw = TRUE)1.0 -3.003e+00 2.820e-03 -1064.88 < 2e-16 ***
polym(x1, x2, degree = 2, raw = TRUE)2.0 1.315e-02 9.659e-05 136.15 < 2e-16 ***
polym(x1, x2, degree = 2, raw = TRUE)0.1 2.059e-01 6.536e-03 31.51 2.12e-14 ***
polym(x1, x2, degree = 2, raw = TRUE)1.1 -1.000e+00 1.059e-04 -9446.87 < 2e-16 ***
polym(x1, x2, degree = 2, raw = TRUE)0.2 1.998e-01 1.511e-04 1322.68 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01167 on 14 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.298e+08 on 5 and 14 DF, p-value: < 2.2e-16
#In summary
# Term model Fitted
# Intercept 1 .959
# x1 -3 -3
# x1^2 0.013 .0132
# x2 0.2 .206
# x2^2 0.2 .1998
# x1 * x2 -1 -1
The first digit after the ")" is the power of the first term and the number after the "." is the power of the second term.

Related

Calculate by hand Fitted Values of an Interaction from a regression output

I am working with an interaction model similar to this one below:
set.seed(1993)
moderating <- sample(c("Yes", "No"),100, replace = T)
x <- sample(c("Yes", "No"), 100, replace = T)
y <- sample(1:100, 100, replace = T)
df <- data.frame(y, x, moderating)
Results <- lm(y ~ x*moderating)
summary(Results)
Call:
lm(formula = y ~ x * moderating)
Residuals:
Min 1Q Median 3Q Max
-57.857 -29.067 3.043 22.960 59.043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.4000 6.1639 8.501 2.44e-13 ***
xYes 8.4571 9.1227 0.927 0.356
moderatingYes -11.4435 8.9045 -1.285 0.202
xYes:moderatingYes -0.1233 12.4563 -0.010 0.992
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 30.82 on 96 degrees of freedom
Multiple R-squared: 0.04685, Adjusted R-squared: 0.01707
F-statistic: 1.573 on 3 and 96 DF, p-value: 0.2009
I'm learning how to calculate the fitted value of a interaction from a regression table. In the example, the base category (or omitted category) is x= No and moderating = No.
Thus far, I know the following fitted values:
#Calulate Fitted Value From a Regression Interaction by hand
#Omitted Variable = X_no.M_no
X_no.M_no <- 52.4000
X_yes.M_no <- 52.4000 + 8.4571
X_no.M_yes <- 52.4000 + -11.4435
X_yes.M_yes #<- ?
I do not understand how the final category, X_yes.M_yes, is calculated. My initial thoughts were X_yes.M_yes <- 52.4000 + -0.1233, (the intercept plus the interaction term) but that is incorrect. I know its incorrect because, using the predict function, the fitted value of X_yes.M_yes = 49.29032, not 52.4000 + -0.1233 = 52.2767.
How do I calculate, by hand, the predicted value of the X_yes.M_yes category?
Here are the predicted values as generated from the predict function in R
#Validated Here Using the Predict Function:
newdat <- NULL
for(m in na.omit(unique(df$moderating))){
for(i in na.omit(unique(df$x))){
moderating <- m
x <- i
newdat<- rbind(newdat, data.frame(x, moderating))
}
}
Prediction.1 <- cbind(newdat, predict(Results, newdat, se.fit = TRUE))
Prediction.1
Your regression looks like this in math:
hat_y = a + b x + c m + d m x
Where x = 1 when "yes" and 0 when "no" and m is similarly defined by moderating.
Then X_yes.M_yes implies x = 1 and m = 1, so your prediction is a + b + c + d.
or in your notation X_yes.M_yes = 52.4000 + 8.4571 - 11.4435 - 0.1233

Find intersection of 1:1 line and lm line

I'm evaluating the performance of a numerical deterministic model, and I'm evaluating its predictive performance against observed data. I made a scatter plot of the observed (Vsurface) vs modeled (Vmod) data, fit a lm (the red line), and added a 1:1 line. I want to find the point where these two lines intersect so I can document where the model shifts from over-predicting to under-predicting. Is there an easy way to do this? Here is the code for the lm:
lm <- lm(Vmod~Vsurface, data = v)
summary(lm)
Call:
lm(formula = Vmod ~ Vsurface, data = v)
Residuals:
Min 1Q Median 3Q Max
-0.63267 -0.11995 -0.03618 0.13816 0.60314
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.20666 0.06087 3.395 0.00185 **
Vsurface 0.43721 0.06415 6.816 1.05e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2232 on 32 degrees of freedom
Multiple R-squared: 0.5921, Adjusted R-squared: 0.5794
F-statistic: 46.45 on 1 and 32 DF, p-value: 1.047e-07
Here is the plot code:
ggplot(data = v, aes(x = Vsurface, y = Vmod)) +
geom_point(col = "slateblue2") +
geom_smooth(method = "lm", col = "red") +
geom_abline(intercept = 0, slope = 1)
I'm working in R markdown.
Just to explicitly state G5W's comment - the model is a list and the co-efficients can be extracted like this:
lmodel <- lm(Vmod~Vsurface, data = v)
x1 <- lmodel$coefficients[1]/(1-lmodel$coefficients[2])
### x1 is the intersection point
Edited step by step:
x <- rnorm(100,10,2)
y <- rnorm(100,15,3)
lmodel <- lm(y ~x)
lmodel$coefficients
Intercept = 13.6578378
x = 0.1283835
x1 <- lmodel$coefficients[1]/(1 - lmodel$coefficients[2])
x1
15.66955

R: lm() result differs when using `weights` argument and when using manually reweighted data

In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.

Need help modeling data with a log() function

I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)

How to simulate data properly?

Hi I'm new to R and would like to ask a more general question. How do I simulate or create an example data set which is suitable to be posted here and simultaneously posses the property of reproducibility. I would like, for instance, create a numeric example which abstract my data set properly. One condition woud be to implement some correlation between my dependent and independent variables.
For instance. how to introduce some correlation between my count and my in.var1 and in.var2?
set.seed(1122)
count<-rpois(1000,30)
in.var1<- rnorm(1000, mean = 25, sd = 3)
in.var1<- rnorm(1000, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
You can introduce dependence by adding in some portion of the "information" in the two variables to the construction of the count variable:
set.seed(1222)
in.var1<- rnorm(1000, mean = 25, sd = 3)
#Corrected spelling of in.var2
in.var2<- rnorm(1000, mean = 12, sd = 2)
count<-rpois(1000,30) + 0.15*in.var1 + 0.3*in.var2
# Avoid use 'data` as an object name
dat<-data.frame(count,in.var1,in.var2)
> spearman(count, in.var1)
rho
0.06859676
> spearman(count, in.var2)
rho
0.1276568
> spearman(in.var1, in.var2)
rho
-0.02175273
> summary( glm(count ~ in.var1 + in.var2, data=dat) )
Call:
glm(formula = count ~ in.var1 + in.var2, data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-16.6816 -3.6910 -0.4238 3.4435 15.5326
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.05034 1.74084 16.688 < 2e-16 ***
in.var1 0.14701 0.05613 2.619 0.00895 **
in.var2 0.35512 0.08228 4.316 1.74e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If you want count to be a function of in.var1 and invar.2 try this. Note that count is already a function name so I am changing it to Count
set.seed(1122)
in.var1<- rnorm(1000, mean = 4, sd = 3)
in.var2<- rnorm(1000, mean = 6, sd = 2)
Count<-rpois(1000, exp(3+ 0.5*in.var1 - 0.25*in.var2))
Data<-data.frame(Count=Count, Var1=in.var1, Var2=in.var2)
You now have a poisson count based on in.var1 and in.var2. A poisson regression will show an intercept of 3 and coefficients of 0.5 for Var1 and -0.25 for Var2
summary(glm(Count~Var1+Var2,data=Data, family=poisson))
Call:
glm(formula = Count ~ Var1 + Var2, family = poisson, data = Data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.84702 -0.76292 -0.04463 0.67525 2.79537
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.001390 0.011782 254.7 <2e-16 ***
Var1 0.499789 0.001004 498.0 <2e-16 ***
Var2 -0.250949 0.001443 -173.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 308190.7 on 999 degrees of freedom
Residual deviance: 1063.3 on 997 degrees of freedom
AIC: 6319.2
Number of Fisher Scoring iterations: 4
As I understand you want to add some pattern to your data.
# Basic info taken from Data Science Exploratory Analysis Course
# http://datasciencespecialization.github.io/courses/04_ExploratoryAnalysis/
set.seed(1122)
rowNumber = 1000
count<-rpois(rowNumber,30)
in.var1<- rnorm(rowNumber, mean = 25, sd = 3)
in.var2<- rnorm(rowNumber, mean = 12, sd = 2)
data<-cbind(count,in.var1,in.var2)
dataNew <- data
for (i in 1:rowNumber) {
# flip a coin
coinFlip <- rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip) {
dataNew[i,"count"] <- 2 * data[i,"in.var1"] + 10* data[i,"in.var2"]
}
}
Basically, I am adding a pattern count = 2 *in.var1 + 10 * in.var2 to some random rows, here coinFlip variable. Of course you should vectorize it for more rows.

Resources