Regression by Groups [duplicate] - r

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 4 years ago.
I have a table:
CityData ->
City Price Bathrooms Bedrooms Porch
Milwaukee 2300 2 3 yes
Chicago 3400 3 2 yes
Springfield 2300 1 1 no
Chicago 2390 2 1 yes
I would like to run a regression for each city (multiple rows per city) to give me coefficients for each city. I want to regress price on the other confounding variables (bathrooms, bedrooms, porch).
I tried the dplyr library:
library(dplyr)
fitted_models = CityData %>%
group_by(CityData$City) %>%
do(model = lm(CityData$Price ~ CityData$Bathrooms +
CityData$Porch + CityData$Bedrooms, data = CityData))
But the output is just
14 lm list
14 lm list
14 lm list
Any suggestions?

You might try something like this. Here I'll use the mtcars data as an example.
df <- mtcars
models <- df %>% group_by(cyl) %>% summarise(mod = list(lm(mpg ~ wt)))
This will give you a new variable mod that holds all the info for your model. You can call the coefficients like:
models$mod[[1]]$coefficients
(Intercept) wt
39.571196 -5.647025
You can get more complex with it too.
models <- df %>% group_by(cyl) %>% summarise(mod = list(lm(mpg ~ wt + hp)))
models$mod[[1]]$coefficients
(Intercept) wt hp
45.83607319 -5.11506233 -0.09052672
Of course models will also still also hold the info for the group
models$cyl
[1] 4 6 8

Related

recipes package cannot create interaction term in step_interact

I'm using a medical insurance data set to hone my modeling skills that looks like this:
> insur_dt
age sex bmi children smoker region charges
1: 19 female 27.900 0 yes southwest 16884.924
2: 18 male 33.770 1 no southeast 1725.552
3: 28 male 33.000 3 no southeast 4449.462
4: 33 male 22.705 0 no northwest 21984.471
5: 32 male 28.880 0 no northwest 3866.855
---
1334: 50 male 30.970 3 no northwest 10600.548
1335: 18 female 31.920 0 no northeast 2205.981
1336: 18 female 36.850 0 no southeast 1629.833
1337: 21 female 25.800 0 no southwest 2007.945
1338: 61 female 29.070 0 yes northwest 29141.360
I'm using recipes as part of the tidymodels meta-package to prepare my data for use in a model, and I have determined that bmi, age, and smoker form an interaction term.
insur_split <- initial_split(insur_dt)
insur_train <- training(insur_split)
insur_test <- testing(insur_split)
# we are going to do data processing and feature engineering with recipes
# below, we are going to predict charges using everything else(".")
insur_rec <- recipe(charges ~ age + bmi + smoker, data = insur_train) %>%
step_dummy(all_nominal()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric()) %>%
step_interact(~ bmi:smoker:age) %>%
prep()
Per the tidymodels guide/documentation, I have to specify the interaction as a step in the recipe as step_interact. However, I am getting an error when I attempt to do so:
> insur_rec <- recipe(charges ~ age + bmi + smoker, data = insur_train) %>%
+ step_dummy(all_nominal()) %>%
+ step_zv(all_numeric()) %>%
+ step_normalize(all_numeric()) %>%
+ step_interact(~ bmi:smoker:age) %>%
+ prep()
Interaction specification failed for: ~bmi:smoker:age. No interactions will be created.partial match of 'object' to 'objects'
I am new to modeling and am not quite sure why I am getting this error. I am simply trying to state that charges is explained by all other predictors, and that smoker (a yes/no factor), age (numeric), and bmi (double) all interact with each other to inform the outcome. What am I doing wrong?
From the documentation:
step_interact can create interactions between variables. It is primarily intended for numeric data; categorical variables should probably be converted to dummy variables using step_dummy() prior to being used for interactions.
step_dummy(all_nominal()) turned the variable smoker into smoker_yes. Below, you'll see that I just changed the name of smoker in the interaction term to smoker_yes.
insur_rec <- recipe(charges ~ bmi + age + smoker, data = insur_train) %>%
step_dummy(all_nominal()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_interact(terms = ~ bmi:age:smoker_yes) %>%
prep(verbose = TRUE, log_changes = TRUE)

How do I do a simple regression multiple times?

I've seen a few variations of this question, but they don't seem to specifically answer what I'm trying to accomplish. I have a data frame (df):
month ter dist emp_count var1 var2
1 1 10 21 3000 5120
2 1 10 20 3100 5340
3 1 10 20 3100 5543
4 1 10 21 3250 5625
5 1 10 24 3200 5254
6 1 10 25 3300 5634
7 1 10 26 3600 5435
8 1 10 26 3900 7546
. . . . . .
. . . . . .
. . . . . .
ter holds the values 1, 2, 3, or 4. And dist can be any number 1 thru 50. I want to do a simple regression multiple times based on ter or dist
I have this:
model = lm(var1 ~ emp_count, data = df)
summary(model)
But I'd rather not write out a regression 50 times if I want to compare based on dist.
Split by your iter or dist, then lapply your fit and summary
lapply(split(df, df$dist), function(x) summary(lm(var1 ~ emp_count, data = x)))
Other solution can be accomplished using dplyr and broom packages. Here is the code for your example. First you need to extract the coefficients and p values for the intercept and slope for every linear model (lm), grouped by the variable dist. tidy is like the summary function used for summary(lm).
library(dplyr)
library(broom)
lmodelsCoef <- df %>%
group_by(dist) %>%
do(tidy(lm(ar1 ~ emp_count, .)))
lmodelsCoef <- lmodelsCoef %>%
group_by(dist) %>%
summarize(intercept = estimate[1],
p.value_intercept = p.value[1],
slope = estimate[2],
p.value_slope = p.value[2])
Next you need to extract the r squared value. However, this value is found in the lm object (not in the summary(lm) one). Therefore, you need to use glance for that.
lmodelsCoef2 <- df %>%
group_by(dist) %>%
do(glance(lm(ar1 ~ emp_count, .)))
lmodelsCoef2 <- lmodelsCoef2 %>%
group_by(dist) %>%
summarize(r.squared = r.squared)
#Get the final df
df_lm<-data.frame(lmodelsCoef,
r.squared = lmodelsCoef2$r.squared)
If you only need the coefficients and p-values, then you can use lmList from nlme or lme4:
library(lme4)
df = data.frame(dist=rep(1:50,each=50),
month=sample(1:12,2500,replace=TRUE),
emp_count=rpois(20,2500),
var1=rpois(2500,40),var2=rpois(2500,50))
lmList(var1 ~ emp_count | dist,data=df)
Call: lmList(formula = var1 ~ emp_count | dist, data = df)
Coefficients:
(Intercept) emp_count
1 9.9885028 1.257080e-02
2 96.5774029 -2.238488e-02
3 11.5427710 1.143071e-02
4 37.5422288 8.699393e-04
5 -44.4468575 3.367506e-02
6 50.4651290 -4.084562e-03
To get p-values,std error etc:
summary(lmList(var1 ~ emp_count | dist,data=df))

loess regression on each group with dplyr::group_by()

Alright, I'm waving my white flag.
I'm trying to compute a loess regression on my dataset.
I want loess to compute a different set of points that plots as a smooth line for each group.
The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.
Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.
I just can't figure out how to make this work on a per-group basis.
Here are some examples of my failed attempts.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))
> test2
# A tibble: 136 x 2
# Groups: CpG [4]
CpG x
<chr> <dbl>
1 cg01003813 0.781
2 cg01003813 0.793
3 cg01003813 0.805
4 cg01003813 0.816
5 cg01003813 0.829
6 cg01003813 0.841
7 cg01003813 0.854
8 cg01003813 0.866
9 cg01003813 0.878
10 cg01003813 0.893
This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::do({
predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
})
This one simply fails with the following error.
"Error: Results 1, 2, 3, 4 must be data frames, not numeric"
Also it still isn't applied as a new column with dplyr::mutate
fems <- fems %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))
This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.
I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.
data looks like this:
> head(test)
X geneID CpG CellLine Meth AVGMOrder neworder Group SmoothMeth
1 40 XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210 1 1 5 0.7808767
2 94 XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120 1 1 5 0.7927130
3 148 XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448 1 1 5 0.8045080
4 202 XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890 1 1 5 0.8163997
5 27 XG cg25296477 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880 2 2 3 0.8285259
6 81 XG cg01003813 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240 2 2 3 0.8409501
unique(test$CpG)
[1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"
So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).
My actual dataset has a few thousand of those CpG's, not just the four.
https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing
This is a neat Tidyverse way to make it work:
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
models <- fems %>%
tidyr::nest(-CpG) %>%
dplyr::mutate(
# Perform loess calculation on each CpG group
m = purrr::map(data, loess,
formula = Meth ~ AVGMOrder, span = .5),
# Retrieve the fitted values from each model
fitted = purrr::map(m, `[[`, "fitted")
)
# Apply fitted y's as a new column
results <- models %>%
dplyr::select(-m) %>%
tidyr::unnest()
# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
geom_point() +
geom_line(aes(y = fitted))
You may have already figured this out -- but if not, here's some help.
Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.
So for your case:
fems <- fems %>%
group_by(CpG) %>%
arrange(CpG, AVGMOrder) %>%
mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))
Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.
Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.
# Define important variables
span_60 <- 60/365 # 60 days of a year
span_365 <- 365/365 # a whole year
# Group and order the data set
df <- as.data.frame(
df %>%
group_by(profile) %>%
arrange(profile, day) %>%
)
)
# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
# Declare that the loess column exists, but is blank
df$loess_60 <- NA
df$loess_365 <- NA
# Identify all unique profilee IDs
all_ids <- unique(x$profile)
# Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
# and isolate the according rows that belong to the profilee ID.
for (i in all_ids) {
len_entries <- length(which(x$profile == i))
queried_rows <- result <- x[which(x$profile == i), ]
# Run the loess fit and write the result to the according column
fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
x[which(x$profile == i), "loess_60"] <- fit_60
x[which(x$profile == i), "loess_365"] <- fit_365
}
# Return the initial data frame
return(x)
}
# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)

How to add a column using the mapping vector after purrr::map_df

I'm using the mtcars dataset as an example to illustrate my question. I ran linear regression on each cylinder type and put all model result together using map_df. (Code and output below). What I want to do is adding another column named 'cylinder' (4,4,6,6,8,8). How can I do that in map_df? When I add argument .id='cylinder', I only got a column as 1,1,2,2,3,3. Thanks a lot in advance.
library(purrr)
cyls <- c(4,6,8)
map_df(cyls, ~tidy(lm(hp~wt,data=mtcars %>% filter(cyl == .x))))
Using set_names should do it
cyls %>%
set_names() %>%
map_df(., ~tidy(lm(hp~wt,data=mtcars %>% filter(cyl == .x))), .id = "cyls")
cyls term estimate std.error statistic
1 4 (Intercept) 69.204726 28.41355 2.43562436
2 4 wt 5.876308 12.09420 0.48587823
3 6 (Intercept) 187.273314 90.85245 2.06129062
4 6 wt -20.848451 28.98418 -0.71930440
5 8 (Intercept) 204.484626 78.77132 2.59592744
6 8 wt 1.182647 19.37501 0.06103983
p.value
1 0.03763378
2 0.63866393
3 0.09427827
4 0.50415924
5 0.02340090
6 0.95233233

Regression in R with Groups [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I have imported a CSV with 3 columns , 2 columns for Y and X and the third column which identifies the category for X ( I have 20 groups/categories). I am able to run a regression at overall level but I want to run regression for the 20 categories separately and store the co-efs.
I tried the following :
list2env(split(sample, sample$CATEGORY_DESC), envir = .GlobalEnv)
Now I have 20 files, how do I run a regression on these 20 files and store the co-effs somewhere.
Since no data was provided, I am generating some sample data to show how you can run multiple regressions and store output using dplyr and broom packages.
In the following, there are 20 groups and different x/y values per group. 20 regressions are run and output of these regressions is provided as a data frame:
library(dplyr)
library(broom)
df <- data.frame(group = rep(1:20, 10),
x = rep(1:20, 10) + rnorm(200),
y = rep(1:20, 10) + rnorm(200))
df %>% group_by(group) %>% do(tidy(lm(x ~ y, data = .)))
Sample output:
Source: local data frame [40 x 6]
Groups: group [20]
group term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 0.42679228 1.0110422 0.4221310 0.684045203
2 1 y 0.45625124 0.7913256 0.5765657 0.580089051
3 2 (Intercept) 1.99367392 0.4731639 4.2134955 0.002941805
4 2 y 0.05101438 0.1909607 0.2671460 0.796114398
5 3 (Intercept) 3.14391308 0.8417638 3.7349114 0.005747126
6 3 y 0.08418715 0.2453441 0.3431391 0.740336702
Quick solution with lmList (package nlme):
library(nlme)
lmList(x ~ y | group, data=df)
Call:
Model: x ~ y | group
Data: df
Coefficients:
(Intercept) y
1 0.4786373 0.04978624
2 3.5125369 -0.94751894
3 2.7429958 -0.01208329
4 -5.2231576 2.24589181
5 5.6370824 -0.24223131
6 7.1785581 -0.08077726
7 8.2060808 -0.18283134
8 8.9072851 -0.13090764
9 10.1974577 -0.18514527
10 6.0687105 0.37396911
11 9.0682622 0.23469187
12 15.1081915 -0.29234452
13 17.3147636 -0.30306692
14 13.1352411 0.05873189
15 6.4006623 0.57619151
16 25.4454182 -0.59535396
17 22.0231916 -0.30073768
18 27.7317267 -0.54651597
19 10.9689733 0.45280604
20 23.3495704 -0.14488522
Degrees of freedom: 200 total; 160 residual
Residual standard error: 0.9536226
Borrowed the data df from #Gopala answer.
Consider also a base solution with lapply():
regressionList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,]))
And only the coefficients:
coeffList <- lapply(unique(df$group),
function(x) lm(x ~ y, df[df$group==x,])$coefficients)
Even list of summaries:
summaryList <- lapply(unique(df$group),
function(x) summary(lm(x ~ y, df[df$group==x,])))

Resources