Data frame couldn´t show - r

I try running the following code, but get error ar shown in the pictures below. Im quite new to R so dont know if its information to the case, but the first column in my data frame called "data" is dates. I get as.Dates.numeric(value) "origin" must be applied, my intuition says it got something to do with the date column, but then again, im a newbie. Just in case, the date column is not supposed to be a part of coef.vec.
v1 <- 2:7
coef.vec <- data.frame(NULL) # create object to keep results
for (i in seq_along(v1)) {
m <- summary(lm(data[,v1[i]] ~ data[,8])) # run model
coef.vec[i, 1] <- names(data)[v1[i]] # print variable name
coef.vec[i, 2] <- m$coefficients[1,1] # intercept
coef.vec[i, 3] <- m$coefficients[2,1] # coefficient
coef.vec[i, 4] <- mean(data[[i]]) # means of variables
}
names(coef.vec) <- c("y.variable", "intercept", "coef.x","variable.mean")
error1
error2

Try this approach using lapply for column 2 to 7 of your data.
coef.vec <- do.call(rbind, lapply(names(data)[2:7], function(x) {
m <- summary(lm(data[[x]] ~ data[[8]]))
data.frame(y.variable = x,
intercept = m$coefficients[1,1],
coef.x = m$coefficients[2,1],
variable.mean = mean(data[[x]]))
}))

We can construct the formula with reformulate, apply the lm, get the summary output with tidy from broom and create a single dataset
library(dplyr)
library(purrr)
library(broom)
map_dfr(names(data)[2:7], ~
tidy(lm(reformulate(names(data)[8], response = .x), data = data)))
Or this can be done in a single step without any loop
tidy(lm(cbind(iris[,1], iris[,2]) ~ Species, iris))
Or
tidy(lm(as.matrix(iris[1:2]) ~ Species, iris))
# A tibble: 6 x 6
# response term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 Sepal.Length (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Sepal.Length Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Sepal.Length Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 Sepal.Width (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Sepal.Width Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Sepal.Width Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10
and check the output from the loop
map_dfr(names(iris)[1:2], ~ tidy(lm(reformulate('Species', response = .x), data = iris)))
# A tibble: 6 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 5.01 0.0728 68.8 1.13e-113
#2 Speciesversicolor 0.93 0.103 9.03 8.77e- 16
#3 Speciesvirginica 1.58 0.103 15.4 2.21e- 32
#4 (Intercept) 3.43 0.0480 71.4 5.71e-116
#5 Speciesversicolor -0.658 0.0679 -9.69 1.83e- 17
#6 Speciesvirginica -0.454 0.0679 -6.68 4.54e- 10

Related

R loop over linear regression

I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.
You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)
Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

How to write a function that will run multiple regression models of the same type with different dependent variables and then store them as lists?

I am trying to write a function that will run multiple regressions and then store the outputs in a vector. What I want is for the function to pick the dependent variables from a list that I will provide, and then run the regressions on the same right hand-side variables. Not sure how to go about doing this. Any hints will be appreciated.
my_data <- data.frame(x1=(1:10) + rnorm(10, 3, 1.5), x2=25/3 + rnorm(10, 0, 1),
dep.var1=seq(5, 28, 2.5), dep.var2=seq(100, -20, -12.5),
dep.var3=seq(1, 25, 2.5))
## The following is a list that tells the function
dep.var <- list(dep.var1=my_data$dep.var1, dep.var2=my_data$dep.var2)
## which dependent variables to use from my_data
all_models <- function(dep.var){lm(dep.var ~ x1 + x2, data=my_data)}
model <- sapply(dep.var, all_models) ## The "sapply" here tells the function to
## take the dependent variables from the list dep.var.
I want the "model" list to have two objects: model1 with dep.var1 and model2 with dep.var2. Then as required, I will use summary(model#) to see the regression output.
I know that this in theory works when a vector is used (i.e., p):
p <- seq(0.25, 0.95, 0.05)
s <- function(p) {1 - pnorm(35, p*1*44, sqrt(44)*sqrt(p*(1 - p)))}
f <- sapply(p, s)
But I can't get the whole thing to work as required for my regression models. It works somewhat because you can run and check "model" and it will show you the two regression outputs - but it is horrible. And the "model" does not show the regression specification, i.e., dep.var1 ~ x1 + x2.
Consider reformulate to dynamically change model formulas using character values for lm calls:
# VECTOR OF COLUMN NAMES (NOT VALUES)
dep.vars <- c("dep.var1", "dep.var2")
# USER-DEFINED METHOD TO PROCESS DIFFERENT DEP VAR
run_model <- function(dep.var) {
fml <- reformulate(c("x1", "x2"), dep.var)
lm(fml, data=data)
}
# NAMED LIST OF MODELS
all_models <- sapply(dep.vars, run_model, simplify = FALSE)
# OUTPUT RESULTS
all_models$dep.var1
all_models$dep.var2
...
From there, you can run further extractions or processes across model objects:
# NAMED LIST OF MODEL SUMMARIES
all_summaries <- lapply(all_models, summary)
all_summaries$dep.var1
all_summaries$dep.var2
...
# NAMED LIST OF MODEL COEFFICIENTS
all_coefficients <- lapply(all_models, `[`, "coefficients")
all_coefficients$dep.var1
all_coefficients$dep.var2
...
You could sapply over the names of the dependent vars which you could nicely identify with grep. In lm use reformulate to build the formula.
sapply(grep('^dep', names(my_data), value=TRUE), \(x)
lm(reformulate(c('x1', 'x2'), x), my_data))
# dep.var1 dep.var2 dep.var3
# coefficients numeric,3 numeric,3 numeric,3
# residuals numeric,10 numeric,10 numeric,10
# effects numeric,10 numeric,10 numeric,10
# rank 3 3 3
# fitted.values numeric,10 numeric,10 numeric,10
# assign integer,3 integer,3 integer,3
# qr qr,5 qr,5 qr,5
# df.residual 7 7 7
# xlevels list,0 list,0 list,0
# call expression expression expression
# terms dep.var1 ~ x1 + x2 dep.var2 ~ x1 + x2 dep.var3 ~ x1 + x2
# model data.frame,3 data.frame,3 data.frame,3
The dep.var* appear nicely in the result.
However, you probably want to use lapply and pipe it into setNames() to get the list elements named. Instead of grep you may of course define the dependent variables manually. To get a clean formular call stored, we use a trick once #g-grothendieck taught me using do.call.
dv <- as.list(grep('^dep', names(my_data), value=TRUE)[1:2])
res <- lapply(dv, \(x) {
f <- reformulate(c('x1', 'x2'), x)
do.call('lm', list(f, quote(my_data)))
}) |>
setNames(dv)
res
# $dep.var1
#
# Call:
# lm(formula = dep.var1 ~ x1 + x2, data = my_data)
#
# Coefficients:
# (Intercept) x1 x2
# -4.7450 2.3398 0.2747
#
#
# $dep.var2
#
# Call:
# lm(formula = dep.var2 ~ x1 + x2, data = my_data)
#
# Coefficients:
# (Intercept) x1 x2
# 148.725 -11.699 -1.373
This allows you to get the summary() of the objects, which probably is what you want.
summary(res$dep.var1)
# Call:
# lm(formula = dep.var1 ~ x1 + x2, data = my_data)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.8830 -1.8345 -0.2326 1.4335 4.2452
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -4.7450 7.2884 -0.651 0.536
# x1 2.3398 0.2836 8.251 7.48e-05 ***
# x2 0.2747 0.7526 0.365 0.726
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.55 on 7 degrees of freedom
# Multiple R-squared: 0.9117, Adjusted R-squared: 0.8865
# F-statistic: 36.14 on 2 and 7 DF, p-value: 0.0002046
Finally you could wrap it in a function
calc_models <- \(dv) {
lapply(dv, \(x) {
f <- reformulate(c('x1', 'x2'), x)
do.call('lm', list(f, quote(my_data)))
}) |>
setNames(dv)
}
calc_models(list('dep.var1', 'dep.var2'))
Here is a way how you could iterate through your dataframe and apply the function to the group you define (here dep.var) and save the different models in a dataframe:
library(tidyverse)
library(broom)
my_data %>%
pivot_longer(
starts_with("dep"),
names_to = "group",
values_to = "dep.var"
) %>%
mutate(group = as.factor(group)) %>%
group_by(group) %>%
group_split() %>%
map_dfr(.f = function(df) {
lm(dep.var ~ x1 + x2, data = df) %>%
tidy() %>% # first output
#glance() %>% # second output
add_column(group = unique(df$group), .before=1)
})
dataframe output:
# A tibble: 9 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var1 (Intercept) -5.29 11.6 -0.456 0.662
2 dep.var1 x1 2.11 0.268 7.87 0.000101
3 dep.var1 x2 0.538 1.23 0.437 0.675
4 dep.var2 (Intercept) 151. 57.9 2.61 0.0347
5 dep.var2 x1 -10.6 1.34 -7.87 0.000101
6 dep.var2 x2 -2.69 6.15 -0.437 0.675
7 dep.var3 (Intercept) -9.29 11.6 -0.802 0.449
8 dep.var3 x1 2.11 0.268 7.87 0.000101
9 dep.var3 x2 0.538 1.23 0.437 0.675
list output:
[[1]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var1 (Intercept) -5.29 11.6 -0.456 0.662
2 dep.var1 x1 2.11 0.268 7.87 0.000101
3 dep.var1 x2 0.538 1.23 0.437 0.675
[[2]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var2 (Intercept) 151. 57.9 2.61 0.0347
2 dep.var2 x1 -10.6 1.34 -7.87 0.000101
3 dep.var2 x2 -2.69 6.15 -0.437 0.675
[[3]]
# A tibble: 3 x 6
group term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 dep.var3 (Intercept) -9.29 11.6 -0.802 0.449
2 dep.var3 x1 2.11 0.268 7.87 0.000101
3 dep.var3 x2 0.538 1.23 0.437 0.675
glance output:
group r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 dep.var1 0.927 0.906 2.32 44.3 0.000106 2 -20.8 49.7 50.9 37.8 7 10
2 dep.var2 0.927 0.906 11.6 44.3 0.000106 2 -36.9 81.9 83.1 944. 7 10
3 dep.var3 0.927 0.906 2.32 44.3 0.000106 2 -20.8 49.7 50.9 37.8 7 10

Include in data table regression coeffiecients, std.errors and Pvalues using R

how do I create a data.table in r with coefficient, std.err and Pvlaues with rqpd regression type? It's easy with the coefficients using summary(myregression)[2] but don't know how to get std.err and Pval. Thanks
Try with broom:
library(broom)
library(dplyr)
#Model
mod <- lm(Sepal.Length~.,data=iris)
#Broom
summaryobj <- tidy(mod)
Output:
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.17 0.280 7.76 1.43e-12
2 Sepal.Width 0.496 0.0861 5.76 4.87e- 8
3 Petal.Length 0.829 0.0685 12.1 1.07e-23
4 Petal.Width -0.315 0.151 -2.08 3.89e- 2
5 Speciesversicolor -0.724 0.240 -3.01 3.06e- 3
6 Speciesvirginica -1.02 0.334 -3.07 2.58e- 3
Found a solution that is working
summ <- summary(myregression, se = "boot")
summ
str(summ)
PValues <- summ$coefficients[,4]

How to show reference level from lm output?

I'd like to add a reference level to the final output of linear regression output lm().
For example:
levels(iris$Species)
"setosa" "versicolor" "virginica"
summary(lm(Sepal.Length ~ Petal.Width + Species, iris))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78044 0.08308 57.543 < 2e-16 ***
Petal.Width 0.91690 0.19386 4.730 5.25e-06 ***
Speciesversicolor -0.06025 0.23041 -0.262 0.794
Speciesvirginica -0.05009 0.35823 -0.140 0.889
I'd like to have it like:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78044 0.08308 57.543 < 2e-16 ***
Petal.Width 0.91690 0.19386 4.730 5.25e-06 ***
Speciessetosa
Speciesversicolor -0.06025 0.23041 -0.262 0.794
Speciesvirginica -0.05009 0.35823 -0.140 0.889
I've been looking for it a lot for a while but no clues yet. Any help would be highly appreciated.
#EDIT
Data for further expansion:
iris$Petal.Width <- as.factor(ifelse(iris$Petal.Width >1, "Big", "Small"))
levels(iris$Petal.Width)
"Big" "Small"
Here is a basic workflow you can work off of, is uses dplyr and broom to join your levels with your coefficients table. Right now it requires you know which variables are factors. You could change the NA to "" if you prefer. It also organizes the output alphabetically which will not always put the reference group first. Let me know if you have any issues with scaling of this:
library(broom)
library(dplyr)
iris <- datasets::iris
iris$Petal.Width <- factor(ifelse(iris$Petal.Width > 1, "Big", "Small"), levels = c("Small", "Big"))
reg_obj <- lm(Sepal.Length ~ Petal.Width + Species, iris)
factor_levels <- tibble(term = c(paste0("Species", levels(iris$Species)),
paste0("Petal.Width", levels(iris$Petal.Width))))
full_join(tidy(reg_obj), factor_levels, by = "term") %>%
arrange(term)
# A tibble: 6 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 5.01 0.0709 70.6 1.03e-114
2 Petal.WidthBig 0.607 0.204 2.97 3.51e- 3
3 Petal.WidthSmall NA NA NA NA
4 Speciessetosa NA NA NA NA
5 Speciesversicolor 0.408 0.202 2.02 4.55e- 2
6 Speciesvirginica 0.975 0.228 4.28 3.33e- 5
This produces the desired output:
res <- capture.output(summary(lm(Sepal.Length ~ Petal.Width + Species, data = iris)))
res[14:22] <- res[13:21]
res[13] <- "Speciessetosa"
cat(res, sep = "\n")

How to iterate over columns with vectorization together with group_by function from dplyr

As explained by Fitting several regression models with dplyr, we can use the tidy function from broom package to run the regression across groups. For instance, a demo code for iris dataset is listed below, but what if, in a simultaneous manner, we intend to loop over the multiple columns and run the regression with different dependent variables (Sepal.Length,Sepal.Width,Petal.Length) together with this group_by manipulation, how can I integrate the (s)apply function into such a situation and get the results for these regression models(3*3=9)?
library(dplyr);library(broom)
res1=iris%>%
group_by(Species)%>%
do(res=lm(Sepal.Length~Petal.Width,data=.))
tidy(res1, res)%>%
filter(term!="(Intercept)")
You can do this using lme4::lmList and broom.mixed::tidy. You may be able to adapt it to a pipe, but this should get you started. Here, lmList essentially performs the same function as group_by in the dplyr pipe, but it is easier for me to conceptualize how to pipe through several DVs using lapply. Good luck!!
library(lme4)
library(broom.mixed)
# Selecting DVs
dvs <- names(iris)[1:3]
# Making formula objects
formula_text <- paste0(dvs, "~ Petal.Width | Species")
formulas <- lapply(formula_text, formula)
# Running grouped analyses and looping through DVs
results <- lapply(formulas, function(x) {
res <- broom.mixed::tidy(lmList(x, iris))
res[res$terms != "(Intercept)",]
})
# Renaming and viewing results
names(results) <- formula_text
And, viewing the results:
results
$`Sepal.Length~ Petal.Width | Species`
# A tibble: 3 x 6
group terms estimate p.value std.error statistic
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa Petal.Width 0.930 0.154 0.649 1.43
2 versicolor Petal.Width 1.43 0.0000629 0.346 4.12
3 virginica Petal.Width 0.651 0.00993 0.249 2.61
$`Sepal.Width~ Petal.Width | Species`
# A tibble: 3 x 6
group terms estimate p.value std.error statistic
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa Petal.Width 0.837 0.0415 0.407 2.06
2 versicolor Petal.Width 1.05 0.00000306 0.217 4.86
3 virginica Petal.Width 0.631 0.0000855 0.156 4.04
$`Petal.Length~ Petal.Width | Species`
# A tibble: 3 x 6
group terms estimate p.value std.error statistic
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa Petal.Width 0.546 2.67e- 1 0.490 1.12
2 versicolor Petal.Width 1.87 3.84e-11 0.261 7.16
3 virginica Petal.Width 0.647 7.55e- 4 0.188 3.44

Resources