I need to run a simple regression using Lm() in R. Its simple because I have only one independent variable. However the catch is that I need to test this independent variable for a number of dependents which are columns in a data frame.
So basically I have one common X and numerous Y's for which i need to extract the intercept and slope and store them all in a data frame.
In excel this is possible with the intercept and slope functions and then dragging across columns. I need something in R that would basically do the same, I could of course run separate regressions , but the requirement is that I need to run all of them in one loop and store estimates of intercept and slopes together for each.
Im still learning R and any help on this would be great. Thanks :)
The lmList function in package nlme was designed for this.
Let's use the iris dataset as an example:
DF <- iris[, 1:4]
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
#...
First we have to reshape it. We want Sepal.Length as the dependent and the other columns as predictors in this example.
library(reshape2)
DF <- melt(DF, id.vars = "Sepal.Length")
# Sepal.Length variable value
#1 5.1 Sepal.Width 3.5
#2 4.9 Sepal.Width 3.0
#3 4.7 Sepal.Width 3.2
#4 4.6 Sepal.Width 3.1
#5 5.0 Sepal.Width 3.6
#6 5.4 Sepal.Width 3.9
#...
Now we can do the fits.
library(nlme)
mods <- lmList(Sepal.Length ~ value | variable,
data = DF, pool = FALSE)
We can now extract intercept and slope for each model.
coef(mods)
# (Intercept) value
#Sepal.Width 6.526223 -0.2233611
#Petal.Length 4.306603 0.4089223
#Petal.Width 4.777629 0.8885803
And get the usual t-table:
summary(mods)
# Call:
# Model: Sepal.Length ~ value | variable
# Data: DF
#
# Coefficients:
# (Intercept)
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width 6.526223 0.47889634 13.62763 6.469702e-28
# Petal.Length 4.306603 0.07838896 54.93890 2.426713e-100
# Petal.Width 4.777629 0.07293476 65.50552 3.340431e-111
# value
# Estimate Std. Error t value Pr(>|t|)
# Sepal.Width -0.2233611 0.15508093 -1.440287 1.518983e-01
# Petal.Length 0.4089223 0.01889134 21.646019 1.038667e-47
# Petal.Width 0.8885803 0.05137355 17.296454 2.325498e-37
Or the R-squared values:
summary(mods)$r.squared
#[1] 0.01382265 0.75995465 0.66902769
However, if you need something more efficient, you can use package data.table together with lm's workhorse lm.fit:
library(data.table)
setDT(DF)
DF[, setNames(as.list(lm.fit(cbind(1, value),
Sepal.Length)[["coefficients"]]),
c("intercept", "slope")), by = variable]
# variable intercept slope
#1: Sepal.Width 6.526223 -0.2233611
#2: Petal.Length 4.306603 0.4089223
#3: Petal.Width 4.777629 0.8885803
And of course the R.squared values of these models are just the squared Pearson correlation coefficients:
DF[, .(r.sq = cor(Sepal.Length, value)^2), by = variable]
# variable r.sq
#1: Sepal.Width 0.01382265
#2: Petal.Length 0.75995465
#3: Petal.Width 0.66902769
Related
how do I calculate the mean square of all 2019_Preston_STD,2019_Preston_V1,2019_Preston_V2 etc using the Value column, then the adjmth1, adjmth3 columns
structure(list(IDX = c("2019_Preston_STD", "2019_Preston_V1",
"2019_Preston_V2", "2019_Preston_V3", "2019_Preston_W1", "2019_Preston_W2"
), Value = c(3L, 2L, 3L, 2L, 3L, 5L), adjmth1 = c(2.87777777777778,
1.85555555555556, 2.01111111111111, 1.77777777777778, 3.62222222222222,
4.45555555555556), adjmth3 = c(2.9328763348507, 2.08651828334684,
2.80282946626847, 2.15028039284054, 2.68766916156347, 4.51425274916654
), adjmth13 = c(2.81065411262847, 1.82585524933201, 1.81394057737959,
1.40785681078568, 3.30989138378569, 4.7301083495049)), row.names = 29:34, class = "data.frame")
This task can be done in many ways, as shown in the link that #r2evans pointed out. My favorite one is dplyr using summarize(across() because to me its syntax is easy to understand and easy to apply to many columns. It also presents the resulted numbers in nice format.
For example, from iris data I want to get the arithmetic mean of Sepal.Length, Petal.Length, and Petal.Width for each of species : setosa, versicolor, and virginica. Here is the head of the data:
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
And here is how to get the mean in each species:
iris %>% group_by(Species) %>%
summarize(across(c(Sepal.Length, Petal.Length, Petal.Width), mean))
# A tibble: 3 x 4
# Species Sepal.Length Petal.Length Petal.Width
# <fct> <dbl> <dbl> <dbl>
# 1 setosa 5.01 1.46 0.246
# 2 versicolor 5.94 4.26 1.33
# 3 virginica 6.59 5.55 2.03
As for your task, first you need to define the function for the mean square (because its definition slightly varies in some references). Then, you apply it to your data frame using summarize(across()).
For example, you define the mean square function as follows:
meansq <- function(x) sum((x-mean(x))^2)/(length(x)-1)
Note: This definition requires that length(x) doesn't equal 1, or otherwise NaN will be produced.
You can apply it to your data frame newdata as follows:
newdata %>% group_by(IDX) %>%
summarize(across(c(Value, adjmth1, adjmth3), meansq)
This question is the second part of a previous question (Linear Regression prediction in R using Leave One out Approach).
I'm trying to build models for each country and generate linear regression predictions using the leave one out approach. In other words, in the code below when building model1 and model2 the "data" used should not be the entire data set. Instead it should be a subset of the dataset (country). Each country data should be evaluated using a model built with data specific to that country.
The code below returns an error. How can I modify/fix the code below to do that? Or is there a better way of doing that?
library(modelr)
install.packages("gapminder")
library(gapminder)
data(gapminder)
#CASE 1
model1 <- lm(lifeExp ~ pop, data = gapminder, subset = country)
model2 <- lm(lifeExp ~ pop + gdpPercap, data = gapminder, subset = country)
models <- list(fit_model1 = model1,fit_model2 = model2)
gapminder %>% nest_by(continent, country) %>%
bind_cols(
map(1:nrow(gapminder), function(i) {
map_dfc(models, function(model) {
training <- data[-i, ]
fit <- lm(model, data = training)
validation <- data[i, ]
predict(fit, newdata = validation)
})
}) %>%
bind_rows()
)
The most succinct and straightforward solution would be a nested for loop approach, where the outer loop is the two model formulae and the inner loop is the unity we want to leave out. This can also be done with outer, which I also show afterwards.
For sake of clarity I first show how to leave out one observation (i.e. one row) in each iteration (Part I). I show later how to leave out one cluster (e.g. country) (Part II). I also use the built-in iris data set, which is smaller and thus easier to handle. It contains a "Species" column that is meant to correspond to the "countries" in your data.
Part I
First, we put the two formulae into a list and name them as we would like them to appear in the resulting columns later.
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
For the loop, we want to initialize a matrix im whose rows correspond to the number of rows we want to leave out, and columns to the number of model formulae.
im <- matrix(NA, nrow=nrow(iris), ncol=length(FOAE),
dimnames=list(NULL, names(FOAE)))
This would look like this:
head(im, n=3)
# fit1 fit2
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
Now we loop over formulas and rows as described above.
for (i in seq(FOAE)) {
for(j in seq(nrow(iris))) {
train <- iris[-j,]
test <- iris[j,]
fit <- lm(FOAE[[i]], data=train)
im[j, i] <- predict(fit, newdata=test)
}
}
im has now been filled, and we may cbind it to the original iris data set to get our result res1.
res1 <- cbind(iris, im)
head(res1)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
To alternatively follow the outer approach, we put the code inside the for loop into a formula which we Vectorize so that it can handle matrix columns (i.e. vectors).
FUN1 <- Vectorize(function(x, y) {
train <- iris[-x,]
test <- iris[x,]
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
Now we put FOAE and the rows 1:nrow(iris) to leave out subsequently, together with FUN1 into outer(). This already gives us the result that we can cbind to iris in the same way as above to get our result res2.
o1 <- outer(FOAE, 1:nrow(iris), FUN1)
res2 <- cbind(iris, o1)
head(res2)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# 1 5.1 3.5 1.4 0.2 setosa 2.388501 1.611976
# 2 4.9 3.0 1.4 0.2 setosa 2.014324 1.501389
# 3 4.7 3.2 1.3 0.2 setosa 1.639805 1.392955
# 4 4.6 3.1 1.5 0.2 setosa 1.446175 1.333199
# 5 5.0 3.6 1.4 0.2 setosa 2.201646 1.556620
# 6 5.4 3.9 1.7 0.4 setosa 2.944788 2.127184
## test if results are different is negative
stopifnot(all.equal(res1, res2))
Part II
We may follow a similar approach when leaving out a cluster (i.e. species or countries). I show here the outer method. The thing we want to change is that we now want to leave out observations belonging to a specific cluster, here "Species" (in your case "countries"), which unique values we put into a vector Species.u . Since the values are in "character" or "factor" format we subset the data using data[!data$cluster %in% x, ] instead of data[-x, ]. Because predict would yield multiple values in the clusters, but we want the same value in the respective clusters, we might want to use a statistic, e.g. the mean prediction of each cluster. We use rownames according to the cluster.
FUN2 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
mean(predict(fit, newdata=test))
})
Species.u <- unique(iris$Species)
o2 <- `rownames<-`(outer(Species.u, FOAE, FUN2), Species.u)
This now gives us a matrix which is smaller than our data set. Thanks to the rownames we may match the predictions tho the clusters to which they belong.
o2
# fit1 fit2
# setosa 3.609943 2.662609
# versicolor 3.785760 3.909919
# virginica 4.911009 5.976922
res3 <- cbind(iris, o2[match(iris$Species, rownames(o2)), ])
head(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa 5.1 3.5 1.4 0.2 setosa 3.609943 2.662609
# setosa.1 4.9 3.0 1.4 0.2 setosa 3.609943 2.662609
# setosa.2 4.7 3.2 1.3 0.2 setosa 3.609943 2.662609
# setosa.3 4.6 3.1 1.5 0.2 setosa 3.609943 2.662609
# setosa.4 5.0 3.6 1.4 0.2 setosa 3.609943 2.662609
# setosa.5 5.4 3.9 1.7 0.4 setosa 3.609943 2.662609
tail(res3)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# virginica.44 6.7 3.3 5.7 2.5 virginica 4.911009 5.976922
# virginica.45 6.7 3.0 5.2 2.3 virginica 4.911009 5.976922
# virginica.46 6.3 2.5 5.0 1.9 virginica 4.911009 5.976922
# virginica.47 6.5 3.0 5.2 2.0 virginica 4.911009 5.976922
# virginica.48 6.2 3.4 5.4 2.3 virginica 4.911009 5.976922
# virginica.49 5.9 3.0 5.1 1.8 virginica 4.911009 5.976922
Edit
In this version of FUN2, FUN3, the output of the models of each cluster are rbinded (in two columns of course, because of two models).
FUN3 <- Vectorize(function(x, y) {
train <- iris[!iris$Species %in% x,]
test <- iris[iris$Species %in% x,]
fit <- lm(y, data=train)
(predict(fit, newdata=test))
}, SIMPLIFY=F)
Species.u <- unique(iris$Species)
o3 <- `rownames<-`(outer(Species.u, FOAE, FUN3), Species.u)
res32 <- cbind(iris, apply(o3, 2, unlist))
head(res32)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 3.706940 2.678255
# setosa.2 4.9 3.0 1.4 0.2 setosa 3.500562 2.547587
# setosa.3 4.7 3.2 1.3 0.2 setosa 3.294183 2.416919
# setosa.4 4.6 3.1 1.5 0.2 setosa 3.190994 2.351586
# setosa.5 5.0 3.6 1.4 0.2 setosa 3.603751 2.612921
# setosa.6 5.4 3.9 1.7 0.4 setosa 4.016508 3.073249
Edit 2
As I learned in your comment you want 1. a subset of your data along clusters. This would be ss in FUN4 below. Then the ss is also subsetted by leaving out one row z over the rows of subset ss.
FUN4 <- Vectorize(function(x, y) {
## subsets first by cluster then by row
ss <- iris[iris$Species %in% x,] ## cluster subset
sapply(1:nrow(ss), function(z) { ## subset rows using `sapply`
train <- ss[-z,] ## train data w/o row z
test <- ss[z,] ## test data for `predict`, just row z
fit <- lm(y, data=train)
predict(fit, newdata=test)
})
}, SIMPLIFY=F)
## the two models
FOAE <- list(fit1=Petal.Length ~ Sepal.Length,
fit2=Petal.Length ~ Sepal.Length + Petal.Width)
## unique cluster names
Species.u <- unique(iris$Species)
## with the `outer` we iterate over all the permutations of clusters and models `FOAE`.
o4 <- `rownames<-`(outer(Species.u, FOAE, FUN4), Species.u)
## `unlist`ed result is directly `cbind`able to original data
res4 <- cbind(iris, apply(o4, 2, unlist))
## result
head(res4)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species fit1 fit2
# setosa.1 5.1 3.5 1.4 0.2 setosa 1.476004 1.451029
# setosa.2 4.9 3.0 1.4 0.2 setosa 1.449120 1.431737
# setosa.3 4.7 3.2 1.3 0.2 setosa 1.426185 1.416492
# setosa.4 4.6 3.1 1.5 0.2 setosa 1.404040 1.398103
# setosa.5 5.0 3.6 1.4 0.2 setosa 1.462460 1.441295
# setosa.6 5.4 3.9 1.7 0.4 setosa 1.504990 1.559045
I want to use dplyr to group a data.frame, fit linear regressions and save the residuals as a column in the original, ungrouped data.frame.
Here's an example
> iris %>%
select(Sepal.Length, Sepal.Width) %>%
group_by(Species) %>%
do(mod = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
Returns:
Species mod
1 setosa <S3:lm>
2 versicolor <S3:lm>
3 virginica <S3:lm>
Instead, I would like the original data.frame with a new column containing residuals.
For example,
Sepal.Length Sepal.Width resid
1 5.1 3.5 0.04428474
2 4.9 3.0 0.18952960
3 4.7 3.2 -0.14856834
4 4.6 3.1 -0.17951937
5 5.0 3.6 -0.12476423
6 5.4 3.9 0.06808885
I adapted an example from http://jimhester.github.io/plyrToDplyr/.
r <- iris %>%
group_by(Species) %>%
do(model = lm(Sepal.Length ~ Sepal.Width, data=.)) %>%
do((function(mod) {
data.frame(resid = residuals(mod$model))
})(.))
corrected <- cbind(iris, r)
update Another method is to use the augment function in the broom package:
r <- iris %>%
group_by(Species) %>%
do(augment(lm(Sepal.Length ~ Sepal.Width, data=.))
Which returns:
Source: local data frame [150 x 10]
Groups: Species
Species Sepal.Length Sepal.Width .fitted .se.fit .resid .hat
1 setosa 5.1 3.5 5.055715 0.03435031 0.04428474 0.02073628
2 setosa 4.9 3.0 4.710470 0.05117134 0.18952960 0.04601750
3 setosa 4.7 3.2 4.848568 0.03947370 -0.14856834 0.02738325
4 setosa 4.6 3.1 4.779519 0.04480537 -0.17951937 0.03528008
5 setosa 5.0 3.6 5.124764 0.03710984 -0.12476423 0.02420180
...
A solution that seems to be easier than the ones proposed so far and closer to the code of the original question is :
iris %>%
group_by(Species) %>%
do(data.frame(., resid = residuals(lm(Sepal.Length ~ Sepal.Width, data=.))))
Result :
# A tibble: 150 x 6
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species resid
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 0.0443
2 4.9 3 1.4 0.2 setosa 0.190
3 4.7 3.2 1.3 0.2 setosa -0.149
4 4.6 3.1 1.5 0.2 setosa -0.180
5 5 3.6 1.4 0.2 setosa -0.125
6 5.4 3.9 1.7 0.4 setosa 0.0681
7 4.6 3.4 1.4 0.3 setosa -0.387
8 5 3.4 1.5 0.2 setosa 0.0133
9 4.4 2.9 1.4 0.2 setosa -0.241
10 4.9 3.1 1.5 0.1 setosa 0.120
Since you are be running the exact same regression for each group, you might find it simpler to just define your regression model as a function() beforehand, and then execute it for each group using mutate.
model<- function(y,x){
a<- y + x
if( length(which(!is.na(a))) <= 2 ){
return( rep(NA, length(a)))
} else {
m<- lm( y ~ x, na.action = na.exclude)
return( residuals(m))
}
}
Note, that the first part of this function is to insure against any error messages popping up in case your regression is run on a group with less than zero degrees of freedom (This might be the case if you have a dataframe with several grouping variables with many levels , or numerous independent variables for your regression (like for example lm(y~ x1 + x2)), and can't afford to inspect each of them for sufficient non-NA observations).
So your example can be rewritten as follows:
iris %>% group_by(Species) %>%
mutate(resid = model(Sepal.Length,Sepal.Width) ) %>%
select(Sepal.Length,Sepal.Width,resid)
Which should yield:
Species Sepal.Length Sepal.Width resid
<fctr> <dbl> <dbl> <dbl>
1 setosa 5.1 3.5 0.04428474
2 setosa 4.9 3.0 0.18952960
3 setosa 4.7 3.2 -0.14856834
4 setosa 4.6 3.1 -0.17951937
5 setosa 5.0 3.6 -0.12476423
6 setosa 5.4 3.9 0.06808885
This method should not be computationally much different from the one using augment().(I've had to use both methods on data sets containing several hundred million observations, and believe there was no significant difference in terms of speed compared to using the do() function).
Also, please note that omitting na.action = na.exclude, or using m$residuals instead of residuals(m), will result in the exclusion of rows that have NAs (dropped prior to estimation) from the output vector of residuals. The corresponding vector will thus not have sufficient length() in order to be merged with the data set, and some error message might appear.
iris <- read.csv("iris.csv") #iris data available in R
library(rpart)
iris.rpart <- rpart(Species~Sepal.length+Sepal.width+Petal.width+Petal.length,
data=iris)
plotcp(iris.rpart)
printcp(iris.rpart)
iris.rpart1 <- prune(iris.rpart, cp=0.047)
plot(iris.rpart1,uniform=TRUE)
text(iris.rpart1, use.n=TRUE, cex=0.6)
I have tried to get the rpart done on the iris data. However, is it possible by using some function in R to get the rules applied by rpart for this current tree preparation so that we know how the classifications are made when we add further new points to the data set?
The
rpart.plot
package has a function
rpart.rules for generating a set of rules for a tree. For example
library(rpart.plot)
iris.rpart <- rpart(Species~., data=iris)
rpart.rules(iris.rpart)
gives
Species seto vers virg
setosa [1.00 .00 .00] when Petal.Length < 2.5
versicolor [ .00 .91 .09] when Petal.Length >= 2.5 & Petal.Width < 1.8
virginica [ .00 .02 .98] when Petal.Length >= 2.5 & Petal.Width >= 1.8
And
options(width=1000)
rpart.predict(iris.rpart, newdata=iris[50:52,], rules=TRUE)
gives you the rule used to make each prediction:
setosa versicolor virginica
50 1 0.00000 0.000000 because Petal.Length < 2.5
51 0 0.90741 0.092593 because Petal.Length >= 2.5 & Petal.Width < 1.8
52 0 0.90741 0.092593 because Petal.Length >= 2.5 & Petal.Width < 1.8
For more examples see Chapter 4 of the
rpart.plot vignette.
I'm completely new to R and would like to turn each column label (header?) in my data set into a regressor without having to define each regressor one at a time, i.e. date -> data$Date
Is there a way to do this all at once?
Thank you in advance!
Is this what you want:
R > data(iris)
R > head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
R > lm(Sepal.Length ~ ., data = iris)
Call:
lm(formula = Sepal.Length ~ ., data = iris)
Coefficients:
(Intercept) Sepal.Width Petal.Length Petal.Width
2.1713 0.4959 0.8292 -0.3152
Speciesversicolor Speciesvirginica
-0.7236 -1.0235
If you want to choose specific column you can use this:
data is sample.data with dependent variable in col 3 and cols 1, 2, 4:8 are independent variables
yy<-lm(as.formula(paste(colnames(sample.data)[3], "~",paste(colnames(sample.data)[c(1, 2,4:8)], collapse = "+"), sep = "")), data=sample.data)
)
summary(yy)