How to create a linear regression with R? - r

I have a simple matrix like:
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
I have to calculate a linear regression of these columns, like: lm(x ~ y)
Where thefirst column is the X, and the other are the Y? I mean... can I do something to use the other with one variable(y)
or
do i have to use something like: lm(x~y+z+c+b) etc etc ?
Thank you

Yes, but I wouldn't really recommend it:
> set.seed(2)
> mat <- matrix(runif(12), ncol = 3, byrow = TRUE)
> mat
[,1] [,2] [,3]
[1,] 0.1848823 0.7023740 0.5733263
[2,] 0.1680519 0.9438393 0.9434750
[3,] 0.1291590 0.8334488 0.4680185
[4,] 0.5499837 0.5526741 0.2388948
> mod <- lm(mat[,1] ~ mat[,-1])
> mod
Call:
lm(formula = mat[, 1] ~ mat[, -1])
Coefficients:
(Intercept) mat[, -1]1 mat[, -1]2
1.0578 -1.1413 0.1177
Why is this not recommended? Well, you are abusing the formula interface here; it works but the model coefficients have odd names and you are incurring a lot of overhead of working with the formula interface, which is designed for extracting response/covariates from a data frame or list object referenced in the symbolic formula.
The usual way of working is:
df <- data.frame(mat)
names(df) <- c("Y","A","B")
## specify all terms:
lm(Y ~ A + B, data = df)
## or use the `.` shortcut
lm(Y ~ ., data = df)
If you don't want to go via the data frame, then you can call the workhorse function behind lm(), lm.fit(), directly with a simple manipulation:
lm.fit(cbind(rep(1, nrow(mat)), mat[,-1]), mat[, 1])
here we bind on a vector of 1s to columns 2 and 3 of mat (cbind(rep(1, nrow(mat)), mat[,-1])); this is the model matrix. mat[, 1] is the response. Whilst it doesn't return an "lm" classed object, it will be very quick and can relatively easily be converted to one if that matters.
By the way, you have the usual notation back to front. Y is usually the response, with X indicating the covariates used to model or predict Y.

Related

Regression with multivariate vectors and coefficients extraction

I want to create 1000 samples of 200 bivariate normally distributed vectors
set.seed(42) # for sake of reproducibility
mu <- c(1, 1)
S <- matrix(c(0.56, 0.4,
0.4, 1), nrow=2, ncol=2, byrow=TRUE)
bivn <- mvrnorm(200, mu=mu, Sigma=S)
so that I can run OLS regressions on each sample and therefore get 1000 estimators. I tried this
library(MASS)
bivn_1000 <- replicate(1000, mvrnorm(200, mu=mu, Sigma=S), simplify=FALSE)
but I am stuck there, because now I don't know how to proceed to run the regression for each sample.
I would appreciate the help to know how to run these 1000 regressions and then extract the coefficients.
We could write a custom regression function.
regFun1 <- function(x) summary(lm(x[, 1] ~ x[, 2]))
which we can loop over the data with lapply:
l1 <- lapply(bivn_1000, regFun1)
The coefficients are saved inside a list and can be extracted like so:
l1[[1]]$coefficients # for the first regression
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.5554601 0.06082924 9.131466 7.969277e-17
# x[, 2] 0.4797568 0.04255711 11.273246 4.322184e-23
Edit:
If we solely want the estimators without statistics, we adjust the output of the function accordingly.
regFun2 <- function(x) summary(lm(x[, 1] ~ x[, 2]))$coef[, 1]
Since we may want the output in matrix form we use sapply next.
m2 <- t(sapply(bivn_1000, regFun2))
head(m2)
# (Intercept) x[, 2]
# [1,] 0.6315558 0.4389721
# [2,] 0.5514555 0.4840933
# [3,] 0.6782464 0.3250800
# [4,] 0.6350999 0.3848747
# [5,] 0.5899311 0.3645237
# [6,] 0.6263678 0.3825725
where
dim(m2)
# [1] 1000 2
assures us that we have our 1,000 estimates.

How can I get the contribution by each predictor to the final regression prediction in lm

Using R when I use rlm or lm I would like to get the contribution of each predictor of the model.
The problem occurs when I have interaction terms as I think they are not in the lm object
Bellow is sample data (I am looking for a way that generalized to any number of predictors)
Sample data:
set.seed(1)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m)
betaHat <- coefficients(lmObj)
betaHat
v1 v3 v2 v1:v3 v3:v2
0.03455 -0.50224 -0.57745 0.58905 -0.65592
# How do I get the data.frame or matrix with columns (v1,v3,v2,v1:v3,v3:v2)
# worth [M$v1*v1, ... , (M$v3*M$v2)*v3:v2]
I thought by "contribution" you want explained variance of each term (which an ANOVA table helps), while actually you want term-wise prediction:
predict(lmObj, type = "terms")
See ?predict.lm.
Actually I got it from lm itself, the trick is to ask for x=TRUE
lmObj <- lm(formula=y~0+v1*v3+v2*v3, data=m, x=TRUE)
lmObj$x %*% diag(lmObj$coefficients)
[,1] [,2] [,3] [,4] [,5]
1 0.0522305 -0.68238 -0.53066 1.20993 -0.81898
2 0.0134687 0.05162 -0.45164 -0.02360 0.05273
3 -0.0214632 -0.19470 -0.04306 -0.14187 -0.01896
4 -0.0765156 0.02702 1.14875 0.07019 -0.07021
5 0.0388652 0.69161 -0.35792 -0.91250 0.55985
6 -0.0015524 0.20843 0.03241 0.01098 -0.01528
7 -0.0005594 0.19803 0.08996 0.00376 -0.04029
8 0.0326086 0.02979 0.84928 -0.03298 -0.05722
9 0.0283723 -0.55248 0.27611 0.53213 0.34500
10 0.0205187 -0.38330 -0.24134 0.26699 -0.20921

How to calculate "terms" from predict-function manually when regression has an interaction term

does anyone know how predict-function calculates terms when there are an interaction term in a regression model? I know how to solve terms when regression has no interaction terms in it but when I add one I cant solve those manually anymore. Here is some example data and I would like to see how to calculate those values manually. Thanks! -Aleksi
set.seed(2)
a <- c(4,3,2,5,3) # first I make some data
b <- c(2,1,4,3,5)
e <- rnorm(5)
y= 0.6*a+e
data <- data.frame(a,b,y)
model1 <- lm(y~a*b,data=data) # regression
predict(model1,type='terms',data) # terms
#This gives the result:
a b a:b
1 0.04870807 -0.3649011 0.2049069
2 -0.03247205 -0.7298021 0.7740928
3 -0.11365216 0.3649011 0.2049069
4 0.12988818 0.0000000 -0.5919534
5 -0.03247205 0.7298021 -0.5919534
attr(,"constant")
[1] 1.973031
Your model is technically y ~ b0 + b1*a + b2*a*b + e. Calculating a is done by multiplying independent variable by its coefficient and centering the result. So for example, terms for a would be
cf <- coef(model1)
scale(a * cf[2], scale = FALSE)
[,1]
[1,] 0.04870807
[2,] -0.03247205
[3,] -0.11365216
[4,] 0.12988818
[5,] -0.03247205
which matches your output above.
And since interaction term is nothing else than multiplying independent variables, this translates to
scale(a * b * cf[4], scale = FALSE)
[,1]
[1,] 0.2049069
[2,] 0.7740928
[3,] 0.2049069
[4,] -0.5919534
[5,] -0.5919534

Generation of random variables

I have a problem about the generation of random variables with R .
I have to generate random variables
$X_{ij}$ (i=1,...,25, j=1,...,5 ) knowing that
each X_ij follows a binomial distribution
$X_{ij} \sim Bin(n_{ij}, p_{ij})
$and I know already
$n_{ij}$ and $p_{ij}$
for each index. How to generate these random variable?
I don't know if it could be useful, but I have generated $p_{ij}$ knowing that they are also random variable which follow a beta distribution (hence actually $X_{ij}$ follow a beta binomial)
Let's say you had the following matrices for n and p:
(n <- matrix(4:7, nrow=2))
# [,1] [,2]
# [1,] 4 6
# [2,] 5 7
set.seed(144)
(p <- matrix(rbeta(4, 1, 2), nrow=2))
# [,1] [,2]
# [1,] 0.1582904 0.2794913
# [2,] 0.5176909 0.2889718
Now you can draw samples X_{ij} with something like:
set.seed(144)
matrix(apply(cbind(as.vector(n), as.vector(p)), 1, function(x) rbinom(1, x[1], x[2])), nrow=2)
# [,1] [,2]
# [1,] 0 2
# [2,] 2 2
The cbind part of this expression builds a 2-column matrix containing each (n, p) pairing and the apply part draws a single binomially distributed sample for each (n, p) pair, with the matrix part converting the resulting vector to a matrix.

Returning a vector of attributes shared by a set of objects

I have a list of lm (linear model) objects.
How can I select a particular element (such as the intercept, rank, or residuals) from all the objects in a single call?
I use the plyr package and then if my list of objects was called modelOutput and I want to get out all the predicted values I would do this:
modelPredictions <- ldply(modelOutput, as.data.frame(predict))
if I want all the coefficients I do this:
modelCoef <- ldply(modelOutput, as.data.frame(coef))
Hadley originally showed me how to do this in a previous question.
First I'll generate some example data:
> set.seed(123)
> x <- 1:10
> a <- 3
> b <- 5
> fit <- c()
> for (i in 1:10) {
+ y <- a + b*x + rnorm(10,0,.3)
+ fit[[i]] <- lm(y ~ x)
+ }
Here's one option for grabbing the estimates from each fit:
> t(sapply(fit, function(x) coef(x)))
(Intercept) x
[1,] 3.157640 4.975409
[2,] 3.274724 4.961430
[3,] 2.632744 5.043616
[4,] 3.228908 4.975946
[5,] 2.933742 5.011572
[6,] 3.097926 4.994287
[7,] 2.709796 5.059478
[8,] 2.766553 5.022649
[9,] 2.981451 5.020450
[10,] 3.238266 4.980520
As you mention, other quantities concerning the fit are available. Above I only grabbed the coefficients with the coef() function. Check out the following command for more:
names(summary(fit[[1]]))

Resources