How R graphs regressions with a factor IV - r

I usually use SAS but I trying to use R more. I am trying to show how categorizing a continuous independent variable messes up regressions. So I created some data:
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
x <- rnorm(100) #X is now normally distributed with mean 0 and sd 1, N - 100
y <- 3*x + rnorm(100,0,10) #Y is related to x, but with some noise
x2 <- cut(x, 2) #Cuts x into 2 parts
then I ran a regression on x2:
m2 <- lm(y~as.factor(x2)) #A model with the cut variable
summary(m2)
and the summary was what I expected: A coefficient for the intercept and one for the dummy variable:
Call:
lm(formula = y ~ as.factor(x2))
Residuals:
Min 1Q Median 3Q Max
-30.4646 -6.5614 0.4409 5.4936 29.6696
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.403 1.290 -1.088 0.2795
as.factor(x2)(0.102,2.55] 4.075 2.245 1.815 0.0726 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.56 on 98 degrees of freedom
Multiple R-squared: 0.03253, Adjusted R-squared: 0.02265
F-statistic: 3.295 on 1 and 98 DF, p-value: 0.07257
But when I graphed x vs. y and added a line for the regression from m2, the line was smooth - I would have expected a jump where x2 goes from 0 to 1.
plot(x,y)
abline(reg = m2)
What am I doing wrong? Or am I missing something basic?

Related

Quark in producing fitted values using LM model in R

A colleague and I noticed this interesting quark in the lm function in R.
Say, I am regressing y variable on an x variable and x variable is a factor level variable with two categories (0/1).
When I run the regression and examine the fitted values, there should be two fitted values. One for the intercept and one fitted value when beta = 1.
Instead, there are more than two. Three nearly identical fitted values for the intercept and one fitted value when beta = 1.
Among those that are different, the difference occurs at the last decimal point.
What might be occurring within R that produces this quark? Why are the intercept's fitted values nearly identical but not perfectly identical?
set.seed(1995)
x <- sample(c(0,1), 100, replace = T, prob = c(.5,.5))
y <- runif(100, min = 1, max = 100)
df <- data.frame(x, y)
OLS <- lm(y ~ as.factor(x), data = df)
summary(OLS)
Call:
lm(formula = y ~ as.factor(x), data = df)
Residuals:
Min 1Q Median 3Q Max
-52.374 -25.163 1.776 25.521 46.571
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.503 4.176 13.05 <0.0000000000000002 ***
as.factor(x)1 -5.117 5.683 -0.90 0.37
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.33 on 98 degrees of freedom
Multiple R-squared: 0.008205, Adjusted R-squared: -0.001916
F-statistic: 0.8107 on 1 and 98 DF, p-value: 0.3701
table(OLS$fitted.values)
49.385426930928 54.5027935733593 54.5027935733594 54.5027935733595
54 32 13 1
My hunch is that this is the product of numerical errors as outlined in the first circle of Burn's (2011) R Inferno?

R: Same values for confidence and prediction intervals using predict()

When I try to make prediction and confidence intervals, around a a linear regression model, with two continuous variables and two categorical variables (that may act as dummy variables), results for the two intervals are exactly the same. I'm using the predict() function.
I've already tried with other datasets, that have continuous and discrete variables, but not categorical or dichotomous variables, and intervals are different. I tried removing some variables from the regression model, and the intervals are still the same. On the other hand, I've already compared my data.frame with ones that are exemplified on the R documentation, and I think the problem isn't there.
#linear regression model: modeloReducido
summary(modeloReducido)
> Call: lm(formula = V ~ T * W + P * G, data = Datos)
>
> Residuals:
> Min 1Q Median 3Q Max
> -7.5579 -1.6222 0.3286 1.6175 10.4773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.937674 3.710133 0.253 0.800922
T -12.864441 2.955519 -4.353 2.91e-05 ***
W 0.013926 0.001432 9.722 < 2e-16 ***
P 12.142109 1.431102 8.484 8.14e-14 ***
GBaja 15.953421 4.513963 3.534 0.000588 ***
GMedia 0.597568 4.546935 0.131 0.895669
T:W 0.014283 0.001994 7.162 7.82e-11 ***
P:GBaja -3.249681 2.194803 -1.481 0.141418
P:GMedia -5.093860 2.147673 -2.372 0.019348 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 3.237 on 116 degrees of freedom Multiple
> R-squared: 0.9354, Adjusted R-squared: 0.931 F-statistic: 210 on
> 8 and 116 DF, p-value: < 2.2e-16
#Prediction Interval
newdata1.2 <- data.frame(T=1,W=1040,P=10000,G="Media")
#EP
opt1.PI <- predict.lm(modeloReducido, newdata1.2,
interval="prediction", level=.95)
#Confidence interval
newdata1.1 <- data.frame(T=1,W=1040,P=10000,G="Media")
#EP
opt1.CI <- predict(modeloReducido, newdata1.1,
interval="confidence", level=.95)
opt1.CI
#fit lwr upr
#1 70500.51 38260.24 102740.8
opt1.PI
# fit lwr upr
# 1 70500.51 38260.24 102740.8
opt1.PI and opt1.CI should be different.
The Excel file that I was given out is in the following link:
https://www.filehosting.org/file/details/830581/Datos%20Tarea%204.xlsx

linear regression r comparing multiple observations vs single observation

Based upon answers of my question, I am supposed to get same values of intercept and the regression coefficient for below 2 models. But they are not the same. What is going on?
is something wrong with my code? Or is the original answer wrong?
#linear regression average qty per price point vs all quantities
x1=rnorm(30,20,1);y1=rep(3,30)
x2=rnorm(30,17,1.5);y2=rep(4,30)
x3=rnorm(30,12,2);y3=rep(4.5,30)
x4=rnorm(30,6,3);y4=rep(5.5,30)
x=c(x1,x2,x3,x4)
y=c(y1,y2,y3,y4)
plot(y,x)
cor(y,x)
fit=lm(x~y)
attributes(fit)
summary(fit)
xdum=c(20,17,12,6)
ydum=c(3,4,4.5,5.5)
plot(ydum,xdum)
cor(ydum,xdum)
fit1=lm(xdum~ydum)
attributes(fit1)
summary(fit1)
> summary(fit)
Call:
lm(formula = x ~ y)
Residuals:
Min 1Q Median 3Q Max
-8.3572 -1.6069 -0.1007 2.0222 6.4904
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.0952 1.1570 34.65 <2e-16 ***
y -6.1932 0.2663 -23.25 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.63 on 118 degrees of freedom
Multiple R-squared: 0.8209, Adjusted R-squared: 0.8194
F-statistic: 540.8 on 1 and 118 DF, p-value: < 2.2e-16
> summary(fit1)
Call:
lm(formula = xdum ~ ydum)
Residuals:
1 2 3 4
-0.9615 1.8077 -0.3077 -0.5385
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2692 3.6456 10.497 0.00895 **
ydum -5.7692 0.8391 -6.875 0.02051 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.513 on 2 degrees of freedom
Multiple R-squared: 0.9594, Adjusted R-squared: 0.9391
F-statistic: 47.27 on 1 and 2 DF, p-value: 0.02051
You are not calculating xdum and ydum in a comparable fashion because rnorm will only approximate the mean value you specify, particularly when you are sampling only 30 cases. This is easily fixed however:
coef(fit)
#(Intercept) y
# 39.618472 -6.128739
xdum <- c(mean(x1),mean(x2),mean(x3),mean(x4))
ydum <- c(mean(y1),mean(y2),mean(y3),mean(y4))
coef(lm(xdum~ydum))
#(Intercept) ydum
# 39.618472 -6.128739
In theory they should be the same if (and only if) the mean of the former model is equal to the point in the latter model.
This is not the case in your models, so the results are slightly different. For example the mean of x1:
x1=rnorm(30,20,1)
mean(x1)
20.08353
where the point version is 20.
There are similar tiny differences from your other rnorm samples:
> mean(x2)
[1] 17.0451
> mean(x3)
[1] 11.72307
> mean(x4)
[1] 5.913274
Not that this really matters, but just FYI the standard nomenclature is that Y is the dependent variable and X is the independent variable, which you reversed. Makes no difference of course, but just so you know.

R: Translate the results from lm() to an equation

I'm using R and I want to translate the results from lm() to an equation.
My model is:
Residuals:
Min 1Q Median 3Q Max
-0.048110 -0.023948 -0.000376 0.024511 0.044190
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.17691 0.00909 349.50 < 2e-16 ***
poly(QPB2_REF1, 2)1 0.64947 0.03015 21.54 2.66e-14 ***
poly(QPB2_REF1, 2)2 0.10824 0.03015 3.59 0.00209 **
B2DBSA_REF1DONSON -0.20959 0.01286 -16.30 3.17e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03015 on 18 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9724
F-statistic: 247.6 on 3 and 18 DF, p-value: 8.098e-15
Do you have any idea?
I tried to have something like
f <- function(x) {3.17691 + 0.64947*x +0.10824*x^2 -0.20959*1 + 0.03015^2}
but when I tried to set a x, the f(x) value is incorrect.
Your output indicates that the model includes use of the poly function which be default orthogonalizes the polynomials (includes centering the x's and other things). In your formula there is no orthogonalization done and that is the likely difference. You can refit the model using raw=TRUE in the call to poly to get the raw coefficients that can be multiplied by $x$ and $x^2$.
You may also be interested in the Function function in the rms package which automates creating functions from fitted models.
Edit
Here is an example:
library(rms)
xx <- 1:25
yy <- 5 - 1.5*xx + 0.1*xx^2 + rnorm(25)
plot(xx,yy)
fit <- ols(yy ~ pol(xx,2))
mypred <- Function(fit)
curve(mypred, add=TRUE)
mypred( c(1,25, 3, 3.5))
You need to use the rms functions for fitting (ols and pol for this example instead of lm and poly).
If you want to calculate y-hat based on the model, you can just use predict!
Example:
set.seed(123)
my_dat <- data.frame(x=1:10, e=rnorm(10))
my_dat$y <- with(my_dat, x*2 + e)
my_lm <- lm(y~x, data=my_dat)
summary(my_lm)
Result:
Call:
lm(formula = y ~ x, data = my_dat)
Residuals:
Min 1Q Median 3Q Max
-1.1348 -0.5624 -0.1393 0.3854 1.6814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5255 0.6673 0.787 0.454
x 1.9180 0.1075 17.835 1e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9768 on 8 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9724
F-statistic: 318.1 on 1 and 8 DF, p-value: 1e-07
Now, instead of making a function like 0.5255 + x * 1.9180 manually, I just call predict for my_lm:
predict(my_lm, data.frame(x=11:20))
Same result as this (not counting minor errors from rounding the slope/intercept estimates):
0.5255 + (11:20) * 1.9180
If you are looking for actually visualizing or writing out a complex equation (e.g. something that has restricted cubic spline transformations), I recommend using the rms package, fitting your model, and using the latex function to see it in latex
my_lm <- ols(y~x, data=my_dat)
latex(my_lm)
Note you will need to render the latex code so as to see your equation. There are websites and, if you are using a Mac, Mac Tex software, that will render it for you.

Linear multivariate regression in R

I want to model that a factory takes an input of, say, x tonnes of raw material, which is then processed. In the first step waste materials are removed, and a product P1 is created. For the "rest" of the material, it is processed once again and another product P2 is created.
The problem is that I want to know how much raw material it takes to produce, say, 1 tonne of product P1 and how much raw material it takes to produce 1 tonne of P2.
I know the amount of raw materials, the amount of finished product P1 and P2 but nothing more.
In my mind, this can be modelled through multivariate regression, using P1 and P2 as dependent variables and the total raw material as the independent variable and find the factors <1 for each finished product. Does this seem right?
Also, how can this be achieved using R? From googling, I've found how to conduct multivariable regression, but not multivariate regression in R.
EDIT:
Trying to use:
datas <- read.table("datass.csv",header = TRUE, sep=",")
rawMat <- matrix(datas[,1])
P1 <- matrix(datas[,2])
P2 <- matrix(datas[,3])
fit <- lm(formula = P1 ~ rawMat)
fit
fit2 <-lm(formula = P2 ~ rawMat)
fit2
gave me results which is certainly not aligned with reality. Fit2, for instance returned 0,1381 which should have a value around 0,8. How can I factor in Y1 as well? Fit2 for instance more or less gave me the average P2/RawMat, but the RawMat is the same raw material used to produce both Products, so I would like to have something like 0,8 as the factor for P1, and around the same for the factor of P2.
The R output was only:
Coefficients:
(Intercept) rawMat
-65.6702 0.1381
for fit2. Why doesn't it include "rawMat1", "rawMat2" as in J.R.'s solution?
EDIT2: datass.csv contains 3 columns - the first with the rawMaterial required to produce both Products P1 and P2, the second column represents the tonnes of P1 produces and the last column the same for P2
multivariate multiple regression can be done by lm(). This is very well documented, but here follows a little example:
rawMat <- matrix(rnorm(200), ncol=2)
noise <- matrix(rnorm(200, 0, 0.2), ncol=2)
B <- matrix( 1:4, ncol=2)
P <- t( B %*% t(rawMat)) + noise
fit <- lm(P ~ rawMat)
summary( fit )
with summary output:
Response Y1 :
Call:
lm(formula = Y1 ~ rawMat)
Residuals:
Min 1Q Median 3Q Max
-0.50710 -0.14475 -0.02501 0.11955 0.51882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007812 0.019801 -0.395 0.694
rawMat1 1.002428 0.020141 49.770 <2e-16 ***
rawMat2 3.032761 0.020293 149.445 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1978 on 97 degrees of freedom
Multiple R-squared: 0.9964, Adjusted R-squared: 0.9963
F-statistic: 1.335e+04 on 2 and 97 DF, p-value: < 2.2e-16
Response Y2 :
Call:
lm(formula = Y2 ~ rawMat)
Residuals:
Min 1Q Median 3Q Max
-0.60435 -0.11004 0.02105 0.11929 0.42539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02287 0.01930 1.185 0.239
rawMat1 2.05474 0.01964 104.638 <2e-16 ***
rawMat2 4.00162 0.01978 202.256 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1929 on 97 degrees of freedom
Multiple R-squared: 0.9983, Adjusted R-squared: 0.9983
F-statistic: 2.852e+04 on 2 and 97 DF, p-value: < 2.2e-16
EDIT!:
In your case with a data.frame named datas you could do something like:
datas <- data.frame( y1 = P[,1], y2=P[,2], x1 = rawMat[,1], x2 = rawMat[,2])
fit <- lm( as.matrix(datas[ ,1:2]) ~ as.matrix(datas[,3:4]) )
or instead:
fit <- with(datas, lm( cbind(y1,y2) ~ x1+x2 ))

Resources