I want to model that a factory takes an input of, say, x tonnes of raw material, which is then processed. In the first step waste materials are removed, and a product P1 is created. For the "rest" of the material, it is processed once again and another product P2 is created.
The problem is that I want to know how much raw material it takes to produce, say, 1 tonne of product P1 and how much raw material it takes to produce 1 tonne of P2.
I know the amount of raw materials, the amount of finished product P1 and P2 but nothing more.
In my mind, this can be modelled through multivariate regression, using P1 and P2 as dependent variables and the total raw material as the independent variable and find the factors <1 for each finished product. Does this seem right?
Also, how can this be achieved using R? From googling, I've found how to conduct multivariable regression, but not multivariate regression in R.
Trying to use:
datas <- read.table("datass.csv",header = TRUE, sep=",")
rawMat <- matrix(datas[,1])
P1 <- matrix(datas[,2])
P2 <- matrix(datas[,3])
fit <- lm(formula = P1 ~ rawMat)
fit2 <-lm(formula = P2 ~ rawMat)
gave me results which is certainly not aligned with reality. Fit2, for instance returned 0,1381 which should have a value around 0,8. How can I factor in Y1 as well? Fit2 for instance more or less gave me the average P2/RawMat, but the RawMat is the same raw material used to produce both Products, so I would like to have something like 0,8 as the factor for P1, and around the same for the factor of P2.
The R output was only:
(Intercept) rawMat
-65.6702 0.1381
for fit2. Why doesn't it include "rawMat1", "rawMat2" as in J.R.'s solution?
EDIT2: datass.csv contains 3 columns - the first with the rawMaterial required to produce both Products P1 and P2, the second column represents the tonnes of P1 produces and the last column the same for P2

multivariate multiple regression can be done by lm(). This is very well documented, but here follows a little example:
rawMat <- matrix(rnorm(200), ncol=2)
noise <- matrix(rnorm(200, 0, 0.2), ncol=2)
B <- matrix( 1:4, ncol=2)
P <- t( B %*% t(rawMat)) + noise
fit <- lm(P ~ rawMat)
summary( fit )
with summary output:
Response Y1 :
lm(formula = Y1 ~ rawMat)
Min 1Q Median 3Q Max
-0.50710 -0.14475 -0.02501 0.11955 0.51882
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007812 0.019801 -0.395 0.694
rawMat1 1.002428 0.020141 49.770 <2e-16 ***
rawMat2 3.032761 0.020293 149.445 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1978 on 97 degrees of freedom
Multiple R-squared: 0.9964, Adjusted R-squared: 0.9963
F-statistic: 1.335e+04 on 2 and 97 DF, p-value: < 2.2e-16
Response Y2 :
lm(formula = Y2 ~ rawMat)
Min 1Q Median 3Q Max
-0.60435 -0.11004 0.02105 0.11929 0.42539
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02287 0.01930 1.185 0.239
rawMat1 2.05474 0.01964 104.638 <2e-16 ***
rawMat2 4.00162 0.01978 202.256 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1929 on 97 degrees of freedom
Multiple R-squared: 0.9983, Adjusted R-squared: 0.9983
F-statistic: 2.852e+04 on 2 and 97 DF, p-value: < 2.2e-16
In your case with a data.frame named datas you could do something like:
datas <- data.frame( y1 = P[,1], y2=P[,2], x1 = rawMat[,1], x2 = rawMat[,2])
fit <- lm( as.matrix(datas[ ,1:2]) ~ as.matrix(datas[,3:4]) )
or instead:
fit <- with(datas, lm( cbind(y1,y2) ~ x1+x2 ))


Creating a Line of Best Fit in R

Hi, I was wondering if anyone could help me on how to complete function in R as I am totally new to this computer programming. So apologies if this seems like a silly question to this audience.
So I am currently trying to add a line of best fit onto my scatter graph but I'm not quite understanging how to do this. I've tried many functions like "abline" and "lm" but I'm not sure if I am using the right ideas or whether I am putting incorrect numbers into the functions.
I have cleared the workspace and just left my graph and previous workings so that it looks neater.
thanks in advance for the help.
reproducible data example:
x <- rnorm(100)
y <- 2*(x+rnorm(100))
then this makes a linear model for the intercept and slope:
lmodel <- lm(y ~ x)
which now contains the intercept (Intercept term) and the slope (as the coefficient of the x variable in the model). Here is a summary of the model:
lm(formula = y ~ x)
Min 1Q Median 3Q Max
-3.8412 -1.0767 -0.1808 1.2216 4.1540
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0454 0.1851 0.245 0.807
x 2.1087 0.1814 11.627 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.848 on 98 degrees of freedom
Multiple R-squared: 0.5797, Adjusted R-squared: 0.5754
F-statistic: 135.2 on 1 and 98 DF, p-value: < 2.2e-16
then make the plot using the coef() function to pull out the intercept and slope from the linear model:
plot(x,y) # plots the points
abline(a=coef(lmodel)[1], b=coef(lmodel)[2]) # plots the line, a=intercept, b=slope
personally, I prefer ggplot2 for such things, which would be like this:
ggplot() + geom_point(aes(x,y)) +
geom_smooth(aes(x,y), method="lm", se=F)

Quark in producing fitted values using LM model in R

A colleague and I noticed this interesting quark in the lm function in R.
Say, I am regressing y variable on an x variable and x variable is a factor level variable with two categories (0/1).
When I run the regression and examine the fitted values, there should be two fitted values. One for the intercept and one fitted value when beta = 1.
Instead, there are more than two. Three nearly identical fitted values for the intercept and one fitted value when beta = 1.
Among those that are different, the difference occurs at the last decimal point.
What might be occurring within R that produces this quark? Why are the intercept's fitted values nearly identical but not perfectly identical?
x <- sample(c(0,1), 100, replace = T, prob = c(.5,.5))
y <- runif(100, min = 1, max = 100)
df <- data.frame(x, y)
OLS <- lm(y ~ as.factor(x), data = df)
lm(formula = y ~ as.factor(x), data = df)
Min 1Q Median 3Q Max
-52.374 -25.163 1.776 25.521 46.571
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.503 4.176 13.05 <0.0000000000000002 ***
as.factor(x)1 -5.117 5.683 -0.90 0.37
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.33 on 98 degrees of freedom
Multiple R-squared: 0.008205, Adjusted R-squared: -0.001916
F-statistic: 0.8107 on 1 and 98 DF, p-value: 0.3701
49.385426930928 54.5027935733593 54.5027935733594 54.5027935733595
54 32 13 1
My hunch is that this is the product of numerical errors as outlined in the first circle of Burn's (2011) R Inferno?

How R graphs regressions with a factor IV

I usually use SAS but I trying to use R more. I am trying to show how categorizing a continuous independent variable messes up regressions. So I created some data:
set.seed(1234) #sets a seed. It is good to use the same seed all the time.
x <- rnorm(100) #X is now normally distributed with mean 0 and sd 1, N - 100
y <- 3*x + rnorm(100,0,10) #Y is related to x, but with some noise
x2 <- cut(x, 2) #Cuts x into 2 parts
then I ran a regression on x2:
m2 <- lm(y~as.factor(x2)) #A model with the cut variable
and the summary was what I expected: A coefficient for the intercept and one for the dummy variable:
lm(formula = y ~ as.factor(x2))
Min 1Q Median 3Q Max
-30.4646 -6.5614 0.4409 5.4936 29.6696
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.403 1.290 -1.088 0.2795
as.factor(x2)(0.102,2.55] 4.075 2.245 1.815 0.0726 .
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10.56 on 98 degrees of freedom
Multiple R-squared: 0.03253, Adjusted R-squared: 0.02265
F-statistic: 3.295 on 1 and 98 DF, p-value: 0.07257
But when I graphed x vs. y and added a line for the regression from m2, the line was smooth - I would have expected a jump where x2 goes from 0 to 1.
abline(reg = m2)
What am I doing wrong? Or am I missing something basic?

R: Translate the results from lm() to an equation

I'm using R and I want to translate the results from lm() to an equation.
My model is:
Min 1Q Median 3Q Max
-0.048110 -0.023948 -0.000376 0.024511 0.044190
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.17691 0.00909 349.50 < 2e-16 ***
poly(QPB2_REF1, 2)1 0.64947 0.03015 21.54 2.66e-14 ***
poly(QPB2_REF1, 2)2 0.10824 0.03015 3.59 0.00209 **
B2DBSA_REF1DONSON -0.20959 0.01286 -16.30 3.17e-12 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03015 on 18 degrees of freedom
Multiple R-squared: 0.9763, Adjusted R-squared: 0.9724
F-statistic: 247.6 on 3 and 18 DF, p-value: 8.098e-15
Do you have any idea?
I tried to have something like
f <- function(x) {3.17691 + 0.64947*x +0.10824*x^2 -0.20959*1 + 0.03015^2}
but when I tried to set a x, the f(x) value is incorrect.
Your output indicates that the model includes use of the poly function which be default orthogonalizes the polynomials (includes centering the x's and other things). In your formula there is no orthogonalization done and that is the likely difference. You can refit the model using raw=TRUE in the call to poly to get the raw coefficients that can be multiplied by $x$ and $x^2$.
You may also be interested in the Function function in the rms package which automates creating functions from fitted models.
Here is an example:
xx <- 1:25
yy <- 5 - 1.5*xx + 0.1*xx^2 + rnorm(25)
fit <- ols(yy ~ pol(xx,2))
mypred <- Function(fit)
curve(mypred, add=TRUE)
mypred( c(1,25, 3, 3.5))
You need to use the rms functions for fitting (ols and pol for this example instead of lm and poly).
If you want to calculate y-hat based on the model, you can just use predict!
my_dat <- data.frame(x=1:10, e=rnorm(10))
my_dat$y <- with(my_dat, x*2 + e)
my_lm <- lm(y~x, data=my_dat)
lm(formula = y ~ x, data = my_dat)
Min 1Q Median 3Q Max
-1.1348 -0.5624 -0.1393 0.3854 1.6814
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5255 0.6673 0.787 0.454
x 1.9180 0.1075 17.835 1e-07 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9768 on 8 degrees of freedom
Multiple R-squared: 0.9755, Adjusted R-squared: 0.9724
F-statistic: 318.1 on 1 and 8 DF, p-value: 1e-07
Now, instead of making a function like 0.5255 + x * 1.9180 manually, I just call predict for my_lm:
predict(my_lm, data.frame(x=11:20))
Same result as this (not counting minor errors from rounding the slope/intercept estimates):
0.5255 + (11:20) * 1.9180
If you are looking for actually visualizing or writing out a complex equation (e.g. something that has restricted cubic spline transformations), I recommend using the rms package, fitting your model, and using the latex function to see it in latex
my_lm <- ols(y~x, data=my_dat)
Note you will need to render the latex code so as to see your equation. There are websites and, if you are using a Mac, Mac Tex software, that will render it for you.

Polynomial Regression nonsense Predictions

Suppose I want to fit a linear regression model with degree two (orthogonal) polynomial and then predict the response. Here are the codes for the first model (m1)
Now let's try the same model but instead of using $poly(x,2)$, I will use its columns like:
Let's look at the summaries of m1 and m2.
> summary(m1)
lm(formula = y ~ poly(x, 2))
Min 1Q Median 3Q Max
-2.50347 -0.48752 -0.07085 0.53624 2.96516
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.677e+04 9.912e-02 -169168 <2e-16 ***
poly(x, 2)1 -1.449e+05 9.912e-01 -146195 <2e-16 ***
poly(x, 2)2 -3.726e+04 9.912e-01 -37588 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9912 on 97 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.139e+10 on 2 and 97 DF, p-value: < 2.2e-16
> summary(m2)
lm(formula = y ~ poly(x, 2)[, 1] + poly(x, 2)[, 2])
Min 1Q Median 3Q Max
-2.50347 -0.48752 -0.07085 0.53624 2.96516
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.677e+04 9.912e-02 -169168 <2e-16 ***
poly(x, 2)[, 1] -1.449e+05 9.912e-01 -146195 <2e-16 ***
poly(x, 2)[, 2] -3.726e+04 9.912e-01 -37588 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9912 on 97 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.139e+10 on 2 and 97 DF, p-value: < 2.2e-16
So m1 and m2 are basically the same. Now let's look at the predictions prd.1 and prd.2
> prd.1
1 2 3 4 5 6
-54811.60 -55863.58 -56925.56 -57997.54 -59079.52 -60171.50
> prd.2
1 2 3 4 5 6
49505.92 39256.72 16812.28 -17827.42 -64662.35 -123692.53
Q1: Why prd.2 is significantly different from prd.1?
Q2: How can I obtain prd.1 using the model m2?
m1 is the right way to do this. m2 is entering a whole world of pain...
To do predictions from m2, the model needs to know it was fitted to an orthogonal set of basis functions, so that it uses the same basis functions for the extrapolated new data values. Compare: poly(1:10,2)[,2] with poly(1:12,2)[,2] - the first ten values are not the same. If you fit the model explicitly with poly(x,2) then predict understands all that and does the right thing.
What you have to do is make sure your predicted locations are transformed using the same set of basis functions as used to create the model in the first place. You can use predict.poly for this (note I call my explanatory variables x1 and x2 so that its easy to match the names up):
px = poly(x,2)
x1 = px[,1]
x2 = px[,2]
m3 = lm(y~x1+x2)
newx = 90:110
pnew = predict(px,newx) # px is the previous poly object, so this calls predict.poly
prd.3 = predict(m3, newdata=data.frame(x1=pnew[,1],x2=pnew[,2]))
