Running several linear regressions from a single dataframe in R - r

I have a dataset of export trade data for a single country with 21 columns. The first column indicates the years (1962-2014) while the other 20 are trading partners. I am trying to run linear regressions for the years column and each other column. I have tried the method recommended here: Running multiple, simple linear regressions from dataframe in R that entails using
combn(names(DF), 2, function(x){lm(DF[, x])}, simplify = FALSE)
However this only yields the intercept for each pair which is less important to me than the slope of the regressions.
Additionally I have tried to use my dataset as a time series, however when I try to run
lm(dimnames~., brazilts, na.action=na.exclude)
(where brazilts is my dataset as a time series from "1962" to "2014") it returns:
Error in model.frame.default(formula = dimnames ~ ., data = brazilts, :
object is not a matrix.
I therefore tried the same method with a matrix but then it returned the error:
Error in model.frame.default(formula = . ~ YEAR, data = brazilmatrix, :
'data' must be a data.frame, not a matrix or an array
(where brazilmatrix is my dataset as a data.matrix which includes a column for years).
Really I am not even proficient in R and at this point. The ultimate goal is to create a loop that I can use to get take regressions for a significantly larger dataset of gross exports by country-pair per year for 28 countries. Perhaps I am attacking this in entirely the wrong way, so any help or criticism is welcome. Bare in mind that the years (1962-2014) are in effect my explanatory variable and the value of gross export is my dependent variable, which may be throwing off my syntax in the above examples. Thanks in advance!

Just to add an alternative, I would propose going down this route:
library(reshape2)
library(dplyr)
library(broom)
df <- melt(data.frame(x = 1962:2014,
y1 = rnorm(53),
y2 = rnorm(53),
y3 = rnorm(53)),
id.vars = "x")
df %>% group_by(variable) %>% do(tidy(lm(value ~ x, data=.)))
Here, I just melt the data so that all relevant columns are given by groups of rows, to be able to use dplyr's grouped actions. This gives the following dataframe as output:
Source: local data frame [6 x 6]
Groups: variable [3]
variable term estimate std.error statistic p.value
(fctr) (chr) (dbl) (dbl) (dbl) (dbl)
1 y1 (Intercept) -3.646666114 18.988154862 -0.1920495 0.8484661
2 y1 x 0.001891627 0.009551103 0.1980533 0.8437907
3 y2 (Intercept) -8.939784046 16.206935047 -0.5516024 0.5836297
4 y2 x 0.004545156 0.008152140 0.5575415 0.5795966
5 y3 (Intercept) 21.699503502 16.785586452 1.2927462 0.2019249
6 y3 x -0.010879271 0.008443204 -1.2885240 0.2033785
This is a pretty convenient form to continue working with the coefficients. All that is required is to melt the dataframe so that all columns are rows in the dataset, and then to use dplyr's group_by to carry out the regression in all subsets. broom::tidy puts the regression output into a nice dataframe. See ?broom for more information.
In case you need to keep the models to do adjustments of some sort (which are implemented for lm objects), then you can also do the following:
df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
variable mod
* <fctr> <list>
1 y1 <S3: lm>
2 y2 <S3: lm>
3 y3 <S3: lm>
Here, for each variable, the lm object is stored in the dataframe. So, if you want to get the model output for the first, you can just access it as you would access any normal dataframe, e.g.
tmp <- df %>% group_by(variable) %>% do(mod = lm(value ~ x, data=.))
tmp[tmp$variable == "y1",]$mod
[[1]]
Call:
lm(formula = value ~ x, data = .)
Coefficients:
(Intercept) x
-1.807255 0.001019
This is convenient if you want to apply some methods to all lm objects since you can use the fact that tmp$mod gives you a list of them, which makes it easy to pass to e.g. lapply.

Quite aside from the statistical justification for doing this, the programming problem is an interesting one. Here is a solution, but probably not the most elegant one. First, create a sample data set:
x = c(1962:2014)
y1 = c(rnorm(53))
y2 = c(rnorm(53))
y3 = c(rnorm(53))
mydata = data.frame(x, y1, y2, y3)
attach(mydata)
head(mydata)
# x y1 y2 y3
#1 1962 -0.9884054 -1.68208217 0.5980446
#2 1963 -1.0741098 0.51309753 1.0986366
#3 1964 0.1357549 -0.23427820 0.1482258
#4 1965 -0.8846920 -0.60375400 0.7162992
#5 1966 -0.5529187 0.85573739 0.5541827
#6 1967 0.4881922 -0.09360152 -0.5379037
Next, use a for loop to do several regressions:
for(i in 2:4){
reg = lm(x ~ mydata[,i])
print(reg)
}
Call:
lm(formula = x ~ mydata[, i])
Coefficients:
(Intercept) mydata[, i]
1988.0088 -0.1341
Call:
lm(formula = x ~ mydata[, i])
Coefficients:
(Intercept) mydata[, i]
1987.87 2.07
Call:
lm(formula = x ~ mydata[, i])
Coefficients:
(Intercept) mydata[, i]
1987.304 -4.101

Related

Save R-squared from lm summary as a dataframe

I want to save the result of a lm model into a dataframe. I generated an empty dataframe (Startframe), where I want to save the results.
My dataframe containing the data is called testdata in this case. It contains the Date in the first column and then several Stations in the rest of the colums.
So far this code is working to get the Estimate, Std. Error, t value and Pr(>|t|).
for(i in 2:ncol(testdata)) {
x <- testdata[,1]
y <- testdata[,i]
mod <- lm(y ~ x)
summary(mod)
Startframe[i,] <- c(i,
summary(mod)[['coefficients']]['(Intercept)','Estimate'],
summary(mod)[['coefficients']]['x','Estimate'],
summary(mod)[['coefficients']]['x','Std. Error'],
summary(mod)[['coefficients']]['x','t value'],
summary(mod)[['coefficients']]['x','Pr(>|t|)'])
But how can I also extract the r.squared?
I tried to add summary(mod)[['r.squared']] to the list, but it gives me the wrong numbers.
I know str(summary(mod)) gives me an overview, but I cant figure out how to add it into my loop.
Thanks for your help.
Nice way to work with the same model on different datasets is to use the tidyverse approach using broom package.
In this example I'm using the diamonds dataset to test how carat and depth effects the diamonds' price in different diamond cuts.
require(tidyverse)
require(broom)
diamonds %>%
nest(-cut) %>%
mutate(model = purrr::map(data, function(x) {
lm(price ~ carat + depth, data = x)}),
values = purrr::map(model, glance),
r.squared = purrr::map_dbl(values, "r.squared"),
pvalue = purrr::map_dbl(values, "p.value")) %>%
select(-data, -model, -values)
cut r.squared pvalue
<ord> <dbl> <dbl>
1 Ideal 0.867 0
2 Premium 0.856 0
3 Good 0.851 0
4 Very Good 0.859 0
5 Fair 0.746 0

Hundreds of linear regressions that run by group in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.
I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:
library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg
I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:
Coefficients:
(Intercept) avgp
Adams 6 4.0073332 NA
Adams 7 6.5177389 -7.342443e+00
Adams 8 4.7449321 NA
Adams 9 NA NA
This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.
Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.
Thank you all!!
I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.
Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.
library(tidyverse)
library(broom)
df <- tibble(
id = seq_len(90),
IQ = rnorm(90, 100, 15),
group = rep(c("A", "B", "C"), each = 30),
time = rep(c("T1", "T2"), 45)
)
If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results = map(fit, glance)) %>%
unnest(results) %>%
select(group, r.squared, p.value)
Which will return
# A tibble: 3 x 3
group r.squared p.value
<chr> <dbl> <dbl>
1 A 0.0141 0.532
2 B 0.0681 0.164
3 C 0.00432 0.730
where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.
To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results1 = map(fit, glance),
results2 = map(fit, tidy)) %>%
unnest(results1) %>%
unnest(results2) %>%
select(group, term, estimate, r.squared, p.value) %>%
mutate(estimate = exp(estimate))
To exponentiate the slope, you can just add another mutate() statement. Finally it returns
# A tibble: 6 x 5
group term estimate r.squared p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 3.34e+46 0.0141 0.532
2 A timeT2 3.31e- 2 0.0141 0.532
3 B (Intercept) 1.17e+47 0.0681 0.164
4 B timeT2 1.34e- 3 0.0681 0.164
5 C (Intercept) 8.68e+43 0.00432 0.730
6 C timeT2 1.25e- 1 0.00432 0.730
Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling
summary(lm(IQ ~ time, data = filter(df, group == "A")))
If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).
A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.
If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.
library(dplyr)
mtcars %>% group_by(cyl) %>%
summarise_at(vars(disp:wt), funs(
r.sqr = summary(lm(mpg~.))$r.squared,
intercept = summary(lm(mpg~.))$coefficients[[1]],
slope = summary(lm(mpg~.))$coefficients[[2]],
p.value = summary(lm(mpg~.))$coefficients[[8]]
))
This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.
mtcars %>% group_by(cyl) %>%
summarise(
r.sqr = summary(lm(mpg~wt))$r.squared,
intercept = summary(lm(mpg~wt))$coefficients[[1]],
slope = summary(lm(mpg~wt))$coefficients[[2]],
p.value = summary(lm(mpg~wt))$coefficients[[8]]
)
This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:
df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))
which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain
df$model[[1]]$coefficients[[1]]
[1] 39.5712
While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:
Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":
#Getting the unique list of groups
groups <- unique(df$Group)
#Creating a model summary list to combine the model summary of each model
model_summaries = list()
#Running the models
for(i in 1:length(groups)){
model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
model_summaries[i] <- summary(model)
}
In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)
Let me know if this helps.

How can I make all-possible-regressions in R also include exponents and logs of the variables?

I primarily work in research and statistics and so am not as familiar with programming. I'm using the OLSRR package for statistical analysis when trying to compare as many model specifications as possible using all-possible-regressions.
I use the code:
model <- lm(y ~ ., data = "mydata"
k <- ols_all_subset(model)
k
So far, which gives me a table with R2, adjusted R2, AIC, SIC etc for each combination of variables as is (linear). For example, if the variables are x1, x2 and x3, it gives me a table with R2, AIC, SIC etc for the specifications with every possible linear-linear specifications: with x1 x2 and x3 as regressors, with x1 and x2, with x1 and x3, with x2 and x3, and each of just x1, just x2, and just x3.
I want to also get all possibles for squares and logs of the variables to look at every possible major specification. So I don't get just those variables but also x1^2, log(x1), log(x3), and so on. How should I modify what I'm doing so I can get all-possibles that include possible exponential and possible logarithmic specifications as well in my output table?
I know I could individually create a new column and generate each x1^2, log(x1) etc individually as a new column, but sometimes I have dozens of variables and tons of data so doing every single variable individually each time for each new dataset is a pain.
Here is suggestion on how to automate the process:
You don't provide any sample data, so I'm going to use mtcars:
df <- mtcars[, c(1,5:7)];
Here, the response is column 1, and I consider 3 predictor variables in columns 5-7.
The goal is to automatically build all relevant terms and use those to construct a formula, which we then use in lm.
Build all predictor terms:
# All but the first column are predictors
terms <- sapply(colnames(df)[-1], function(x)
c(x, sprintf("I(%s^2)", x), sprintf("log(%s)", x)));
terms;
# drat wt qsec
#[1,] "drat" "wt" "qsec"
#[2,] "I(drat^2)" "I(wt^2)" "I(qsec^2)"
#[3,] "log(drat)" "log(wt)" "log(qsec)"
Construct the formula expression as a string.
exprs <- sprintf("%s ~ %s", colnames(df)[1], paste(terms, collapse = "+"));
exprs;
[1] "mpg ~ drat+I(drat^2)+log(drat)+wt+I(wt^2)+log(wt)+qsec+I(qsec^2)+log(qsec)"
Run the linear model.
model <- lm(as.formula(exprs), data = df);
Re-fit model using all combinations of predictor variables.
require(olsrr);
k <- ols_all_subset(model);
k;
# # A tibble: 511 x 6
# Index N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
# <int> <int> <chr> <chr> <chr> <chr>
# 1 1 1 log(wt) 0.81015 0.80382 9.73408
# 2 2 1 wt 0.75283 0.74459 21.12525
# 3 3 1 I(wt^2) 0.64232 0.63040 43.08976
# 4 4 1 I(drat^2) 0.46694 0.44917 77.94664
# 5 5 1 drat 0.46400 0.44613 78.53263
# 6 6 1 log(drat) 0.45406 0.43587 80.50653
# 7 7 1 log(qsec) 0.17774 0.15033 135.42760
# 8 8 1 qsec 0.17530 0.14781 135.91242
# 9 9 1 I(qsec^2) 0.17005 0.14239 136.95476
#10 10 2 log(wt) log(qsec) 0.87935 0.87103 -2.02118
## ... with 501 more rows
A few comments:
This very quickly gets very computationally intensive.
The y ~ . formula syntax is very concise, but I haven't found a way to include e.g. quadratic terms. On one hand, y ~ . + (.)^2 works for including all interaction terms; on the other hand, y ~ . + I(.^2) does not work for quadratic terms. That's why I think building terms manually is the way to go.

Fit series of linear models to separate dependent variables for a different grouping variable

Hi: I have six continuous dependent variables and one independent variable for three countires. I would like to see what the coefficient is for y1 to y6 ~ x1 for each country. Is there a way to do this neatly with dplyr and broom? I know dplyr fairly well but am new to broom.
#one random independent variable
x1<-rnorm(100, mean=5, sd=1)
#one random dependent variable
y1<-rnorm(100, mean=2, sd=2)
#two random dependent variables, in reality I have six
y2<-rnorm(100, mean=3, sd=1)
#Grouping variable.
country<-sample(seq(1,3,1), size=100, replace=T)
#data frame
df<-data.frame(x1, y1, y2, country)
#I would like to see what the coefficients are for y1~x1
and then y2 ~x2 for country 1, country 2, country 3, etc.
library(dplyr)
#Fit one model for each of three countries
test<-df%>%
group_by(country) %>%
do(mod.y1=lm(y1~x1, data=.))
#print results
test$mod.y1
You can use a combination of gather from tidyr and tidy from broom. First, instead of doing one fit for each country, do one fit for each combination of y1/y2 and country:
library(tidyr)
library(broom)
fits <- df %>%
gather(variable, value, y1, y2) %>%
group_by(country, variable) %>%
do(mod = lm(value ~ x1, .))
Then you can tidy them (with broom) and filter out the intercept term:
td <- tidy(fits, mod) %>%
filter(term != "(Intecept)")
This gives you a data frame td that looks like:
Source: local data frame [6 x 7]
Groups: country, variable [6]
country variable term estimate std.error statistic p.value
(dbl) (chr) (chr) (dbl) (dbl) (dbl) (dbl)
1 1 y1 x1 0.106140467 0.3835857 0.27670599 0.7835458
2 1 y2 x1 -0.004725751 0.1837192 -0.02572268 0.9796168
3 2 y1 x1 -0.193700062 0.4690913 -0.41292614 0.6826979
4 2 y2 x1 0.094083592 0.2024151 0.46480518 0.6455421
5 3 y1 x1 -0.223523980 0.3820297 -0.58509584 0.5631692
6 3 y2 x1 -0.029720338 0.2116219 -0.14044074 0.8893172
Your estimate column is the estimated coefficients.

Get the p-values from the lm function for grouped data

I am trying to fit a model for each segment in my data using the lm() function in conjunction with the plyr package because my data is grouped by a key.
I've managed to run the model and get the coefficients along with the R^2 & adj r-squared but I am struggling with the p-values.
library("plyr")
#Sample data
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
#model
model_1 <- dlply(test_data, .(key),
function(test_data) lm(y ~ x1 + x2,data = test_data))
#coefficients
ldply(model_1, coef)
#adj r-squared
ldply(model_1, function(x) summary(x)$r.squared)
I've tried this which gets me the key and the p-value but it doesn't have the names of the variables which I need to be able to merge the output with the coefficients from the model later.
#p-values but missing the variable names
ldply(model_1, function(x) summary(x)$coefficients)[,c(1,5)]
I've tried to fit the models using Do and then tidy from the dplyr package and this works fine with a small data set because it actually returns everything I need but my actual data contains over 1,000 different segments and RStudio end up crashing.
I'm using the "dplyr" package to formatting the output. In the function that you use inside the "dlply" function you should use summary() to the lm(), so when you call "coef" it will also include the p.values.
test_data <- data.frame(key = c("a","a","a","a","a","b","b","b","b","b"),
y = c(100,180,120,60,140,200,220,240,260,280),
x1 = c(50,60,79,85,90,133,140,120,160,170),
x2 = c(20,18,47,16,15,25,30,25,20,15))
model<-by(test_data,test_data$key,function(x)summary(lm(y~x1+x2,x)))
R2<-t(data.frame(lapply(model,function(x)x$adj.r.squared)));colnames(R2)<-"R2_adj";R2
R2_adj
a -0.8939647
b 0.4292186
Co<-as.data.frame(t(data.frame(lapply(model,function(x)x$coef))))
colnames(Co)<-c("intercept","x1","x2")
library(dplyr)
Co%>%
mutate(key=substr(rownames(Co),1,1),
variable=substr(rownames(Co),3,12))%>%
select(key,variable,intercept,x1,x2)
key variable intercept x1 x2
1 a Estimate 162.1822438 -0.6037364 0.07628315
2 a Std..Error 141.3436897 1.8054132 2.29385395
3 a t.value 1.1474318 -0.3344035 0.03325545
4 a Pr...t.. 0.3699423 0.7698867 0.97649134
5 b Estimate 271.0532276 0.3624009 -3.62853907
6 b Std..Error 196.2769562 0.9166979 3.25911570
7 b t.value 1.3809733 0.3953330 -1.11335080
8 b Pr...t.. 0.3013515 0.7307786 0.38142882
No need for plyr I think, sapply will do just fine.
sapply(model_1, function(x) summary(x)$coefficients[, 4])
a b
(Intercept) 0.3699423 0.3013515
x1 0.7698867 0.7307786
x2 0.9764913 0.3814288
And t() will get those in the same configuration as your estimates.
By the way, you may want to look at the multidplyr package, to do with tidy and dplyr::do after all.

Resources