A loop to create multiple data frames from a population data frame - r

Suppose I have a data frame called pop, and I wish to split this data frame by a categorical variable called replicate. This replicate consists out of 110 categories, and I wish to perform analyses on each data frame then the output of each must be combined to create a new data frame. In other words suppose it is replicate i then I wish to create data frame i and perform a logistic regression on i and save beta 0 for i. All the beta 0 will be combined to create a table with all the beta 0 for replicate 1-110.
I know that's A mouth full but thanks in advance.

Since you didn't give some sample data I will use mtcars. You can use split to split a data.frame on a categorical value. Combining this with map and tidy from the purrr and broom packages you can create a dataframe with all the beta's in one go.
So what happens is 1: split data.frame, 2: run regression model 3: tidy data to get the coefficients out and create a data.frame of the data.
You will need to adjust this to your data.frame and replicate variable. Broom can handle logistic regression so everything should work out.
library(purrr)
library(broom)
my_lms <- mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map_dfr(~ tidy(.))
my_lms
term estimate std.error statistic p.value
1 (Intercept) 39.571196 4.3465820 9.103980 7.771511e-06
2 wt -5.647025 1.8501185 -3.052251 1.374278e-02
3 (Intercept) 28.408845 4.1843688 6.789278 1.054844e-03
4 wt -2.780106 1.3349173 -2.082605 9.175766e-02
5 (Intercept) 23.868029 3.0054619 7.941551 4.052705e-06
6 wt -2.192438 0.7392393 -2.965803 1.179281e-02
EDIT
my_lms <- lapply(split(mtcars, mtcars$cyl), function(x) lm(mpg ~ wt, data = x))
my_coefs <- as.data.frame(sapply(my_lms, coef))
my_coefs
4 6 8
(Intercept) 39.571196 28.408845 23.868029
wt -5.647025 -2.780106 -2.192438
#Or transpose the coefficents if you want column results.
t(my_coefs)
(Intercept) wt
4 39.57120 -5.647025
6 28.40884 -2.780106
8 23.86803 -2.192438

Related

Obtain P-Value of Fixed Value in Anova Table of many Linear Regressions with Broom Package

In the multi linear regression lm(FE_FCE2 ~ Trial + .x, data = DF_FCE3) there is one fixed variable (trial) and many x variables. I am analysing each x variable against FE_FCE2 with trial as fixed effect. I than use the boom package for the many regressions and plot the results in one table. I have obtained the results for the regression results. However cannot add the data from ANOVA Table into the Broom packages with map function.
Is it possible? And Yes How?
I have used the following formula to obtain Data from Results from Regression:
DF_FCE3 %>%
select(-FE_FCE2, -Trial) %>% # exclude outcome, leave only predictors
map( ~lm(FE_FCE2 ~ Trial + .x, data = DF_FCE3)) %>%
map(summary) %>%
map_df(glance) %>%
round(3) -> rsme
However I would like to obtain the P-Value (**4.26e-08 *****) from the ANOVA Table of Trial.
To
see if Trial had a significant influence on the x variable.
**$x1
Analysis of Variance Table
**Response: FE_FCE2
Df Sum Sq Mean Sq F value Pr(>F)
Trial 3 0.84601 0.282002 15.0653 **4.26e-08 *****
.x 1 0.00716 0.007161 0.3826 0.5377
Residuals 95 1.77827 0.018719**
---**
Is it possible to use the broom package with map function to obtain a table which contains all the many p values of the anova regressions?
Like this (using mpg)?
This returns a dataframe with the original columns and one row containing the p-value except for the outcome and target (hwy and cyl in thisexample, FE_FCE2 and Trial in your case).
mpg %>%
select(-hwy, -cyl) %>% # exclude outcome, leave only predictors
map( ~lm(hwy ~ cyl + .x, data = mpg)) %>%
map(anova) %>%
map(broom::tidy) %>%
map_df(~.$p.value[1])

Save R-squared from lm summary as a dataframe

I want to save the result of a lm model into a dataframe. I generated an empty dataframe (Startframe), where I want to save the results.
My dataframe containing the data is called testdata in this case. It contains the Date in the first column and then several Stations in the rest of the colums.
So far this code is working to get the Estimate, Std. Error, t value and Pr(>|t|).
for(i in 2:ncol(testdata)) {
x <- testdata[,1]
y <- testdata[,i]
mod <- lm(y ~ x)
summary(mod)
Startframe[i,] <- c(i,
summary(mod)[['coefficients']]['(Intercept)','Estimate'],
summary(mod)[['coefficients']]['x','Estimate'],
summary(mod)[['coefficients']]['x','Std. Error'],
summary(mod)[['coefficients']]['x','t value'],
summary(mod)[['coefficients']]['x','Pr(>|t|)'])
But how can I also extract the r.squared?
I tried to add summary(mod)[['r.squared']] to the list, but it gives me the wrong numbers.
I know str(summary(mod)) gives me an overview, but I cant figure out how to add it into my loop.
Thanks for your help.
Nice way to work with the same model on different datasets is to use the tidyverse approach using broom package.
In this example I'm using the diamonds dataset to test how carat and depth effects the diamonds' price in different diamond cuts.
require(tidyverse)
require(broom)
diamonds %>%
nest(-cut) %>%
mutate(model = purrr::map(data, function(x) {
lm(price ~ carat + depth, data = x)}),
values = purrr::map(model, glance),
r.squared = purrr::map_dbl(values, "r.squared"),
pvalue = purrr::map_dbl(values, "p.value")) %>%
select(-data, -model, -values)
cut r.squared pvalue
<ord> <dbl> <dbl>
1 Ideal 0.867 0
2 Premium 0.856 0
3 Good 0.851 0
4 Very Good 0.859 0
5 Fair 0.746 0

Hundreds of linear regressions that run by group in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I have a table with 3,000+ rows and 10+ variables. I am trying to run a linear regression using one variables as the predictor and another as the response for 300 different groups. I need the slope, p-value, and r-squared for each of these regressions. To do each regression individually and record the summary variables would take hours if not days.
I have used the following package to get the intercept and slope for each group, but I do not know how to also get the corresponding p-value and r-squared for each group:
library(lme4)
groupreg<-lmList(logpop ~ avgp | id, data=data)
groupreg
I achieved a list sample below, where "Adams #" is the id value. NAs exist because not all groups have multiple points to plot and compare:
Coefficients:
(Intercept) avgp
Adams 6 4.0073332 NA
Adams 7 6.5177389 -7.342443e+00
Adams 8 4.7449321 NA
Adams 9 NA NA
This table does not include any significance statistics, however. I still need the p-value and r-squared statistic. If there is a code to do it all in one go for all group values, or a code to just pull the remaining values, it would be helpful.
Is there are way also to exponentiate the slope output for all groups? My outcome was log-transformed.
Thank you all!!
I thinks the easiest answer is still missing. You can use a combination of nesting and mapping. I'll show you how it works for linear regression. I think you're able to apply the same principle to models of the lme4 package.
Lets create a toy data set, where we've measured the IQ score for three different groups at two different points of time.
library(tidyverse)
library(broom)
df <- tibble(
id = seq_len(90),
IQ = rnorm(90, 100, 15),
group = rep(c("A", "B", "C"), each = 30),
time = rep(c("T1", "T2"), 45)
)
If we want to build a regression model for each group, investigating the relation between the IQ score and the point of time, we only need five lines of code.
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results = map(fit, glance)) %>%
unnest(results) %>%
select(group, r.squared, p.value)
Which will return
# A tibble: 3 x 3
group r.squared p.value
<chr> <dbl> <dbl>
1 A 0.0141 0.532
2 B 0.0681 0.164
3 C 0.00432 0.730
where nest(-group) creates tibbles within your tibble for each group, containing the corresponding variables of id, IQ and time. Then you add a new column fit with mutate() where you apply a regression model for each group and a new column containing the results, which we unnest() shortly after to access the values glance() returned properly. In the last step we select() the three values of interest.
To get the slope you need to call tidy() in addition. Maybe it's possible to shorten the code somehow, but one solution would be
df %>%
nest(-group) %>%
mutate(fit = map(data, ~ lm(IQ ~ time, data = .)),
results1 = map(fit, glance),
results2 = map(fit, tidy)) %>%
unnest(results1) %>%
unnest(results2) %>%
select(group, term, estimate, r.squared, p.value) %>%
mutate(estimate = exp(estimate))
To exponentiate the slope, you can just add another mutate() statement. Finally it returns
# A tibble: 6 x 5
group term estimate r.squared p.value
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 3.34e+46 0.0141 0.532
2 A timeT2 3.31e- 2 0.0141 0.532
3 B (Intercept) 1.17e+47 0.0681 0.164
4 B timeT2 1.34e- 3 0.0681 0.164
5 C (Intercept) 8.68e+43 0.00432 0.730
6 C timeT2 1.25e- 1 0.00432 0.730
Note that the estimates are exponentiated already. Without the exponentiation you can double check the slope and p value with base R calling
summary(lm(IQ ~ time, data = filter(df, group == "A")))
If you work with more complex models (lme4), there is a package called lmerTest which offers wrapper functions for lme4 which return p-values (at least for mixed models, with which I already worked with).
A word of warning towards using glance() for lme4 models should be spoken, because the maintainers of the broom package, will try a new concept where they outsource the summary statistics to the particular package developer responsible for the model.
If I am understanding your question correctly, you want to run multiple regressions over lots of groups. Here is an example of how to do so with the mtcars data.
library(dplyr)
mtcars %>% group_by(cyl) %>%
summarise_at(vars(disp:wt), funs(
r.sqr = summary(lm(mpg~.))$r.squared,
intercept = summary(lm(mpg~.))$coefficients[[1]],
slope = summary(lm(mpg~.))$coefficients[[2]],
p.value = summary(lm(mpg~.))$coefficients[[8]]
))
This will run a regression per group per variable and extract the info you asked for. If your formula is always the same, you could simplify as follows.
mtcars %>% group_by(cyl) %>%
summarise(
r.sqr = summary(lm(mpg~wt))$r.squared,
intercept = summary(lm(mpg~wt))$coefficients[[1]],
slope = summary(lm(mpg~wt))$coefficients[[2]],
p.value = summary(lm(mpg~wt))$coefficients[[8]]
)
This is actually running the regression 4 times(once per value of interest). If that takes too long for your real data, you could try this:
df <- mtcars %>% group_by(cyl) %>% summarise(model = list(summary(lm(mpg~wt))))
which simply runs the model once per group and then extract out the info you want. The problem is that extracting values this way can be a pain
df$model[[1]]$coefficients[[1]]
[1] 39.5712
While the code given by AndS will work, it will run lm function 4 times for each group which makes it a bit inefficient. You can use the following. I am trying to break it into simpler steps:
Assuming your data frame(df) has three variables: "Group", "Dep", "Indep":
#Getting the unique list of groups
groups <- unique(df$Group)
#Creating a model summary list to combine the model summary of each model
model_summaries = list()
#Running the models
for(i in 1:length(groups)){
model <- lm(Dep ~ Indep, df[df$Group == Groups[i], c("Dep", "Indep")])
model_summaries[i] <- summary(model)
}
In each model summary you have following elements RSQ, coefficients(contains p-values and intercept too)
Let me know if this helps.

Subset variables by significant P value

I'm trying to subset variables by significant P-values, and I attempted with the following code, but it only selects all variables instead of selecting by condition. Could anyone help me to correct the problem?
myvars <- names(summary(backward_lm)$coefficients[,4] < 0.05)
happiness_reduced <- happiness_nomis[myvars]
Thanks!
An alternative solution to Martin's great answer (in the comments section) using the broom package. Unfortunately, you haven't posted an data, so I'm using the mtcars dataset as a demo:
library(broom)
# build model
m = lm(disp ~ ., data = mtcars)
# create a dataframe frm model's output
tm = tidy(m)
# visualise dataframe of the model
# (using non scientific notation of numbers)
options(scipen = 999)
tm
# term estimate std.error statistic p.value
# 1 (Intercept) -5.8119829 228.0609389 -0.02548434 0.97990925639
# 2 mpg 1.9398052 2.5976340 0.74675849 0.46348865035
# 3 cyl 15.3889587 12.1518291 1.26639032 0.21924091701
# 4 hp 0.6649525 0.2259928 2.94236093 0.00777972543
# 5 drat 8.8116809 19.7390767 0.44640796 0.65987184728
# 6 wt 86.7111730 16.1127236 5.38153418 0.00002448671
# 7 qsec -12.9742622 8.6227190 -1.50466021 0.14730421493
# 8 vs -12.1152075 25.2579953 -0.47965832 0.63642812949
# 9 am -7.9135864 25.6183932 -0.30890253 0.76043942893
# 10 gear 5.1265224 18.0578153 0.28389494 0.77927112074
# 11 carb -30.1067073 7.5513212 -3.98694566 0.00067029676
# get variables with p value less than 0.05
tm$term[tm$p.value < 0.05]
# [1] "hp" "wt" "carb"
The main advantage is that by obtaining the model's output as a dataframe you can use variable names, and not variable positions and row names, to manipulate the data.
I'm using options(scipen = 999) to make it easier to check that filtering works (i.e. not using the scientific notation of numbers in the dataframe).

R adding regression coeffcients to data frame

I have a list of dataframes that contains many subsets of data (470ish). I am trying to run a regression on each of them and add the regression coefficients to a dataframe. The dataframe will contain the coefficients for all dependent variables on each subgroup. I tried iterating with a for loop but obviously that is not the right way. I think the solution has something to do with lapply?
for (i in ListOfTraining){
lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC data=ListOfTraining[[i]])
}
Thanks for any advice!
The function tidy from package broom handles this nicely.
library(dplyr) # bind_rows is more efficient than do.call(rbind, ...)
library(broom) # put statistics into data.frame
bind_rows(lapply(ListOfTraining, function(dat)
tidy(lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC, data=dat))))
Example
dataList <- split(mtcars, mtcars$cyl) # list of data.frames by number of cylinders
lapply(dataList, function(dat) tidy(lm(mpg ~ disp + hp, data=dat))) %>% # fit models
bind_rows() %>% # combine into one data.frame
mutate(model=rep(1:length(dataList), each=3)) # add a model ID column
# term estimate std.error statistic p.value model
# 1 (Intercept) 43.040057552 4.235724713 10.16120274 7.531962e-06 1
# 2 disp -0.119536016 0.036945788 -3.23544366 1.195900e-02 1
# 3 hp -0.046091563 0.047423668 -0.97191054 3.595602e-01 1
# 4 (Intercept) 20.151209478 6.938235241 2.90437104 4.392508e-02 2
# 5 disp 0.001796527 0.020195109 0.08895852 9.333909e-01 2
# 6 hp -0.006032441 0.034597750 -0.17435935 8.700522e-01 2
# 7 (Intercept) 24.044775630 4.045729006 5.94324919 9.686231e-05 3
# 8 disp -0.018627566 0.009456903 -1.96973225 7.456584e-02 3
# 9 hp -0.011315585 0.012572498 -0.90002676 3.873854e-01 3
Alternatively, you could bind the data.frames beforehand, assuming they have the same columns. Then, fit models using lmList from nlme package.
## Combine list of data.frames into one data.frame with a factor variable
lengths <- sapply(dataList, nrow) # in case data.frames have different num. rows
dat <- dataList %>% bind_rows() %>%
mutate(group=rep(1:length(dataList), times=lengths)) # group id column
library(nlme) # lmList()
models <- lmList(mpg ~ disp + hp | group, data=dat) # make models, grouped by group
models$coefficients
# (Intercept) disp hp
# 1 43.04006 -0.119536016 -0.046091563
# 2 20.15121 0.001796527 -0.006032441
# 3 24.04478 -0.018627566 -0.011315585
You can solve this using the for loop, if you prefer. Your problem is that the results aren't being saved to an object as the loop progresses. You can see the below for an example using the built-in mtcars dataframe.
(This first example is revised based on OP's request for an example of how to also extract the R squared value.)
ListOfTraining <- list(mtcars, mtcars)
results <- list()
for (i in seq_along(ListOfTraining)) {
lm_obj <- lm(disp ~ qsec, data = ListOfTraining[[i]])
tmp <- c(lm_obj$coefficients, summary(lm_obj)$r.squared)
names(tmp)[length(tmp)] <- "r.squared"
results[[i]] <- tmp
}
results <- do.call(rbind, results)
results
You can also rewrite the for loop using lapply as demoed below.
ListOfTraining <- list(mtcars, mtcars)
results <- list()
results <- lapply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results <- do.call(rbind, results)
results
Finally, you can use the plyr package's ldply function which will convert the list applied outputs into a dataframe automatically (if possible).
ListOfTraining <- list(mtcars, mtcars)
results <- plyr::ldply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results
Your current code runs the regression, but does not do anything with the results (inside of a loop they are not even autoprinted), so they are just discarded. You need to have some structure to save the results into.
The following code will create a matrix of coefficients (assuming that all the regressions run without error and the number of final coefficients is the same):
my.coef <- sapply( ListOfTraining, function(dat) {
coef(lm( JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC,
data=dat) )
})
The matrix can then be converted to a data frame (you could also use lapply and convert to a data frame, but I think the sapply option is probably a little simpler).

Resources