Fitting a Different Linear Model to Each Player [duplicate] - r

I would like to fit a model for each hour(the factor variable) using dplyr, I'm getting an error, and i'm not quite sure what's wrong.
df.h <- data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.h <- tbl_df(df.h)
df.h <- group_by(df.h, hour)
group_size(df.h) # checks out, 21 obs. for each factor variable
# different attempts:
reg.models <- do(df.h, formula = price ~ wind + temp)
reg.models <- do(df.h, .f = lm(price ~ wind + temp, data = df.h))
I've tried various variations, but I can't get it to work.

The easiest way to do this, circa May 2015 is to use broom. broom contains three functions that deal with complex returned objects from statistical operations by groups: tidy (which deals with coefficient vectors from statistical operations by groups), glance (which deals with summary statistics from statistical operations by groups), and augment (which deals with observation level results from statistical operations by groups).
Here is a demonstration of its use to extract the various results of linear regression by groups into tidy data_frames.
tidy:
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
dfHourCoef
which gives,
Source: local data frame [72 x 6]
Groups: hour
hour term estimate std.error statistic p.value
1 1 (Intercept) 53.336069324 21.33190104 2.5002961 0.022294293
2 1 wind -0.008475175 0.01338668 -0.6331053 0.534626575
3 1 temp 1.180019541 0.79178607 1.4903262 0.153453756
4 2 (Intercept) 77.737788772 23.52048754 3.3051096 0.003936651
5 2 wind -0.008437212 0.01432521 -0.5889765 0.563196358
6 2 temp -0.731265113 1.00109489 -0.7304653 0.474506855
7 3 (Intercept) 38.292039924 17.55361626 2.1814331 0.042655670
8 3 wind 0.005422492 0.01407478 0.3852630 0.704557388
9 3 temp 0.426765270 0.83672863 0.5100402 0.616220435
10 4 (Intercept) 30.603119492 21.05059583 1.4537888 0.163219027
.. ... ... ... ... ... ...
augment:
# get the predictions by group in a tidy data_frame
dfHourPred = augment(dfHour, fitHour)
dfHourPred
which gives,
Source: local data frame [504 x 11]
Groups: hour
hour price wind temp .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 1 83.8414055 67.3780 -6.199231 45.44982 22.42649 38.391590 0.27955950 42.24400 0.1470891067 1.0663820
2 1 0.3061628 2073.7540 15.134085 53.61916 14.10041 -53.312993 0.11051343 41.43590 0.0735584714 -1.3327207
3 1 80.3790032 520.5949 24.711938 78.08451 20.03558 2.294497 0.22312869 43.64059 0.0003606305 0.0613746
4 1 121.9023855 1618.0864 12.382588 54.23420 10.31293 67.668187 0.05911743 40.23212 0.0566557575 1.6447224
5 1 -0.4039594 1542.8150 -5.544927 33.71732 14.53349 -34.121278 0.11740628 42.74697 0.0325125137 -0.8562896
6 1 29.8269832 396.6951 6.134694 57.21307 16.04995 -27.386085 0.14318542 43.05124 0.0271028701 -0.6975290
7 1 -7.1865483 2009.9552 -5.657871 29.62495 16.93769 -36.811497 0.15946292 42.54487 0.0566686969 -0.9466312
8 1 -7.8548693 2447.7092 22.043029 58.60251 19.94686 -66.457379 0.22115706 39.63999 0.2983443034 -1.7753911
9 1 94.8736726 1525.3144 24.484066 69.30044 15.93352 25.573234 0.14111563 43.12898 0.0231796755 0.6505701
10 1 54.4643001 2473.2234 -7.656520 23.34022 21.83043 31.124076 0.26489650 42.74790 0.0879837510 0.8558507
.. ... ... ... ... ... ... ... ... ... ... ...
glance:
# get the summary statistics by group in a tidy data_frame
dfHourSumm = glance(dfHour, fitHour)
dfHourSumm
which gives,
Source: local data frame [24 x 12]
Groups: hour
hour r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 1 0.12364561 0.02627290 42.41546 1.2698179 0.30487225 3 -106.8769 221.7538 225.9319 32383.29 18
2 2 0.03506944 -0.07214506 36.79189 0.3270961 0.72521125 3 -103.8900 215.7799 219.9580 24365.58 18
3 3 0.02805424 -0.07993974 39.33621 0.2597760 0.77406651 3 -105.2942 218.5884 222.7665 27852.07 18
4 4 0.17640603 0.08489559 41.37115 1.9277147 0.17434859 3 -106.3534 220.7068 224.8849 30808.30 18
5 5 0.12575453 0.02861615 42.27865 1.2945915 0.29833246 3 -106.8091 221.6181 225.7962 32174.72 18
6 6 0.08114417 -0.02095092 35.80062 0.7947901 0.46690268 3 -103.3164 214.6328 218.8109 23070.31 18
7 7 0.21339168 0.12599076 32.77309 2.4415266 0.11529934 3 -101.4609 210.9218 215.0999 19333.36 18
8 8 0.21655629 0.12950699 40.92788 2.4877430 0.11119114 3 -106.1272 220.2543 224.4324 30151.65 18
9 9 0.23388711 0.14876346 35.48431 2.7476160 0.09091487 3 -103.1300 214.2601 218.4381 22664.45 18
10 10 0.18326177 0.09251307 40.77241 2.0194425 0.16171339 3 -106.0472 220.0945 224.2726 29923.01 18
.. ... ... ... ... ... ... .. ... ... ... ... ...

As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).
library(dplyr)
library(broom)
library(tidyr)
set.seed(42)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups: hour [24]
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 87.4 15.8 5.55 0.0000289
# 2 1 wind -0.0129 0.0120 -1.08 0.296
# 3 1 temp 0.588 0.849 0.693 0.497
# 4 2 (Intercept) 92.3 21.6 4.27 0.000466
# 5 2 wind -0.0227 0.0134 -1.69 0.107
# 6 2 temp -0.216 0.841 -0.257 0.800
# 7 3 (Intercept) 61.1 18.6 3.29 0.00409
# 8 3 wind 0.00471 0.0128 0.367 0.718
# 9 3 temp 0.425 0.964 0.442 0.664
# 10 4 (Intercept) 31.6 15.3 2.07 0.0529
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups: hour [24]
# hour price wind temp .fitted .resid .hat .sigma .cooksd .std.resid
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 113. 288. -1.75 82.7 30.8 0.123 37.8 0.0359 0.877
# 2 1 117. 2234. 18.4 69.5 47.0 0.201 36.4 0.165 1.40
# 3 1 28.6 1438. 4.75 71.7 -43.1 0.0539 37.1 0.0265 -1.18
# 4 1 102. 366. 9.77 88.5 13.7 0.151 38.4 0.00926 0.395
# 5 1 76.6 2257. -4.69 55.6 21.0 0.245 38.2 0.0448 0.644
# 6 1 60.1 633. -3.18 77.4 -17.3 0.0876 38.4 0.00749 -0.484
# 7 1 89.4 376. -4.16 80.1 9.31 0.119 38.5 0.00314 0.264
# 8 1 8.18 1921. 19.2 74.0 -65.9 0.173 34.4 0.261 -1.93
# 9 1 78.7 575. -6.11 76.4 2.26 0.111 38.6 0.000170 0.0640
# 10 1 85.2 763. -0.618 77.2 7.94 0.0679 38.6 0.00117 0.219
# # … with 494 more rows
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups: hour [24]
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0679 -0.0357 37.5 0.655 0.531 2 -104. 217.
# 2 2 0.139 0.0431 42.7 1.45 0.261 2 -107. 222.
# 3 3 0.0142 -0.0953 43.1 0.130 0.879 2 -107. 222.
# 4 4 0.0737 -0.0293 36.7 0.716 0.502 2 -104. 216.
# 5 5 0.213 0.126 37.8 2.44 0.115 2 -104. 217.
# 6 6 0.0813 -0.0208 33.5 0.796 0.466 2 -102. 212.
# 7 7 0.0607 -0.0437 40.7 0.582 0.569 2 -106. 220.
# 8 8 0.153 0.0592 36.3 1.63 0.224 2 -104. 215.
# 9 9 0.166 0.0736 36.5 1.79 0.195 2 -104. 216.
# 10 10 0.110 0.0108 40.0 1.11 0.351 2 -106. 219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# # df.residual <int>, nobs <int>
Credits to Bob Muenchen's Blog for inspiration on that.

In dplyr 0.4, you can do:
df.h %>% do(model = lm(price ~ wind + temp, data = .))

from the documentation for do:
.f: a function to apply to each piece. The first unnamed argument supplied to .f will be a data frame.
So:
reg.models <- do(df.h,
.f=function(data){
lm(price ~ wind + temp, data=data)
})
Probably useful to also save which hour the model was fitted for:
reg.models <- do(df.h,
.f=function(data){
m <- lm(price ~ wind + temp, data=data)
m$hour <- unique(data$hour)
m
})

I believe there's a more compact answer than loki's answer, which abandons the since replaced/superseded do():
library(dplyr)
library(broom)
library(tidyr)
h.lm <- df.h %>%
nest_by(hour) %>%
mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
summarise(tidy_out = list(tidy(fitHour)),
glance_out = list(glance(fitHour)),
augment_out = list(augment(fitHour))) %>%
ungroup()
h.lm
# # A tibble: 24 x 4
# hour tidy_out glance_out augment_out
# <fct> <list> <list> <list>
# 1 1 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 2 2 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 3 3 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 4 4 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 5 5 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 6 6 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 7 7 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 8 8 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 9 9 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 10 10 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# # … with 14 more rows
similar to their answer, in order to access, simply unnest whichever component is desired:
unnest(select(h.lm, hour, tidy_out))
# # A tibble: 72 x 6
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 63.2 20.9 3.02 0.00728
# 2 1 wind -0.00237 0.0139 -0.171 0.866
# 3 1 temp -0.266 0.950 -0.280 0.783
# 4 2 (Intercept) 65.1 23.0 2.83 0.0111
# 5 2 wind 0.00691 0.0129 0.535 0.599
# 6 2 temp -0.448 0.877 -0.510 0.616
# 7 3 (Intercept) 65.2 17.8 3.67 0.00175
# 8 3 wind 0.00515 0.0112 0.458 0.652
# 9 3 temp -1.87 0.695 -2.69 0.0148
# 10 4 (Intercept) 49.7 17.6 2.83 0.0111
# # … with 62 more rows

I think you can use dplyr in more proper way where you don't need to define function as in #fabians anwser.
results<-df.h %.%
group_by(hour) %.%
do(failwith(NULL, lm), formula = price ~ wind + temp)
or
results<-do(group_by(tbl_df(df.h), hour),
failwith(NULL, lm), formula = price ~ wind + temp)
EDIT:
Of course it also works without failwith
results<-df.h %.%
group_by(hour) %.%
do(lm, formula = price ~ wind + temp)
results<-do(group_by(tbl_df(df.h), hour),
lm, formula = price ~ wind + temp)

As of dplyr 1.0.0, group_split gives a handy shortcut for this action:
library(dplyr)
library(broom)
library(purrr)
df.h <- data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.g <- group_split(df.h, hour)
map_dfr(df.g, function(x) tidy(lm(price ~ wind + temp, data=x)))
#> # A tibble: 72 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -10.4 20.3 -0.512 0.615
#> 2 wind 0.0377 0.0117 3.23 0.00467
#> 3 temp 1.34 0.890 1.50 0.150
#> 4 (Intercept) 34.6 18.6 1.86 0.0799
#> 5 wind 0.0214 0.0125 1.71 0.104
#> 6 temp 0.332 0.865 0.384 0.706
#> 7 (Intercept) 42.5 15.3 2.79 0.0122
#> 8 wind 0.0103 0.0116 0.888 0.386
#> 9 temp -0.542 0.736 -0.736 0.471
#> 10 (Intercept) 64.1 18.8 3.41 0.00312
#> # … with 62 more rows
Created on 2021-03-04 by the reprex package (v1.0.0)

A few revisions of the tidyverse late the do() operator is superseded and we can fit one model per group with one line of code less.
library("broom")
library("tidyverse")
df.h <- data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = -10, max = 25)
)
df.h %>%
group_by(hour) %>%
group_modify(
# Use `tidy`, `glance` or `augment` to extract different information from the fitted models.
~ tidy(lm(price ~ wind + temp, data = .))
)
#> # A tibble: 72 × 6
#> # Groups: hour [24]
#> hour term estimate std.error statistic p.value
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 73.9 16.3 4.52 0.000266
#> 2 1 wind -0.0256 0.0119 -2.15 0.0456
#> 3 1 temp 1.72 0.861 2.00 0.0604
#> 4 2 (Intercept) 81.5 18.4 4.42 0.000331
#> 5 2 wind -0.0111 0.00973 -1.14 0.270
#> 6 2 temp -1.60 0.763 -2.09 0.0506
#> 7 3 (Intercept) 59.9 16.1 3.73 0.00154
#> 8 3 wind 0.00358 0.0102 0.349 0.731
#> 9 3 temp -1.82 0.664 -2.74 0.0134
#> 10 4 (Intercept) 49.6 18.5 2.69 0.0151
#> # … with 62 more rows
Created on 2022-04-20 by the reprex package (v2.0.1)

Related

Average a multiple number of rows for every column, multiple times

Here I have a snippet of my dataset. The rows indicate different days of the year.
The Substations represent individuals, there are over 500 individuals.
The 10 minute time periods run all the way through 24 hours.
I need to find an average value for each 10 minute interval for each individual in this dataset. This should result in single row for each individual substation, with the respective average value for each time interval.
I have tried:
meanbygroup <- stationgroup %>%
group_by(Substation) %>%
summarise(means = colMeans(tenminintervals[sapply(tenminintervals, is.numeric)]))
But this averages the entire column and I am left with the same average values for each individual substation.
So for each individual substation, I need an average for each individual time interval.
Please help!
Try using summarize(across()), like this:
df %>%
group_by(Substation) %>%
summarize(across(everything(), ~mean(.x, na.rm=T)))
Output:
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A -0.233 0.110 -0.106
2 B 0.203 -0.0997 -0.128
3 C -0.0733 0.196 -0.0205
4 D 0.0905 -0.0449 -0.0529
5 E 0.401 0.152 -0.0957
6 F 0.0368 0.120 -0.0787
7 G 0.0323 -0.0792 -0.278
8 H 0.132 -0.0766 0.157
9 I -0.0693 0.0578 0.0732
10 J 0.0776 -0.176 -0.0192
# … with 16 more rows
Input:
set.seed(123)
df = bind_cols(
tibble(Substation = sample(LETTERS,size = 1000, replace=T)),
as_tibble(setNames(lapply(1:3, function(x) rnorm(1000)),c("00:00", "00:10", "00:20")))
) %>% arrange(Substation)
# A tibble: 1,000 × 4
Substation `00:00` `00:10` `00:20`
<chr> <dbl> <dbl> <dbl>
1 A 0.121 -1.94 0.137
2 A -0.322 1.05 0.416
3 A -0.158 -1.40 0.192
4 A -1.85 1.69 -0.0922
5 A -1.16 -0.455 0.754
6 A 1.95 1.06 0.732
7 A -0.132 0.655 -1.84
8 A 1.08 -0.329 -0.130
9 A -1.21 2.82 -0.0571
10 A -1.04 0.237 -0.328
# … with 990 more rows

Dividing a number within a month with the last observation in the previous month using dplyr

I am struggling with finding the correct way of achieving the relative return within a month using the last observation in the previous month. Data for reference:
set.seed(123)
Date = seq(as.Date("2021/12/31"), by = "day", length.out = 90)
Returns = runif(90, min=-0.02, max = 0.02)
mData = data.frame(Date, Returns)
Then, I would like to have a return column. For example: When calculating the returns for February the third, then it should be the returns for the respective dates: 2022-02-03 / 2022-01-31 - 1. And likewise for e.g March the third: 2022-03-03 / 2022-02-28 -1. So the question is, how can I keep the date returns within a month as the numerator while having the last observation in the previous month as the denominator using dplyr?
Using a tmp column to get the previous value from the last month (assuming sorted data) and then picking the first. Grouping is done on year-month in group_by.
mData %>%
mutate(tmp=lag(Returns)) %>%
group_by(dat=strftime(Date, format="%Y-%m")) %>%
mutate(tmp=first(tmp), result=Returns/tmp-1) %>%
ungroup() %>%
select(-c(tmp, dat))
# A tibble: 90 × 5 # before select:
Date Returns result # tmp dat
<date> <dbl> <dbl> # <dbl> <chr>
1 2021-12-31 -0.00850 NA # NA 2021-12
2 2022-01-01 0.0115 -2.36 # -0.00850 2022-01
3 2022-01-02 -0.00364 -0.571 # -0.00850 2022-01
4 2022-01-03 0.0153 -2.80 # -0.00850 2022-01
5 2022-01-04 0.0176 -3.07 # -0.00850 2022-01
6 2022-01-05 -0.0182 1.14 # -0.00850 2022-01
7 2022-01-06 0.00112 -1.13 # -0.00850 2022-01
8 2022-01-07 0.0157 -2.85 # -0.00850 2022-01
9 2022-01-08 0.00206 -1.24 # -0.00850 2022-01
10 2022-01-09 -0.00174 -0.796 # -0.00850 2022-01
# … with 80 more rows
library(tidyverse)
library(lubridate)
set.seed(123)
Date = seq(as.Date("2021/12/31"), by = "day", length.out = 90)
Returns = runif(90, min=-0.02, max = 0.02)
mData = data.frame(Date, Returns)
mData |>
group_by(month(Date)) |>
mutate(last_return = last(Returns)) |>
ungroup() |>
nest(data = c(Date, Returns)) |>
mutate(last_return_lag = lag(last_return)) |>
unnest(data) |>
mutate(x = Returns/last_return_lag)
#> # A tibble: 90 × 6
#> `month(Date)` last_return Date Returns last_return_lag x
#> <dbl> <dbl> <date> <dbl> <dbl> <dbl>
#> 1 12 -0.00850 2021-12-31 -0.00850 NA NA
#> 2 1 0.0161 2022-01-01 0.0115 -0.00850 -1.36
#> 3 1 0.0161 2022-01-02 -0.00364 -0.00850 0.429
#> 4 1 0.0161 2022-01-03 0.0153 -0.00850 -1.80
#> 5 1 0.0161 2022-01-04 0.0176 -0.00850 -2.07
#> 6 1 0.0161 2022-01-05 -0.0182 -0.00850 2.14
#> 7 1 0.0161 2022-01-06 0.00112 -0.00850 -0.132
#> 8 1 0.0161 2022-01-07 0.0157 -0.00850 -1.85
#> 9 1 0.0161 2022-01-08 0.00206 -0.00850 -0.242
#> 10 1 0.0161 2022-01-09 -0.00174 -0.00850 0.204
#> # … with 80 more rows
Created on 2022-02-03 by the reprex package (v2.0.1)

Coalesce column values based on mapping to a vector of values

If I have the following two objects:
> set.seed(100)
> lookup <- sample(1:3, 20, replace=T)
> lookup
[1] 2 3 2 3 1 2 2 3 2 2 3 2 2 3 3 3 3 2 1 3
and
> tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
> tb
> tb
# A tibble: 20 × 3
A B C
<dbl> <dbl> <dbl>
1 0.770 0.780 0.456
2 0.882 0.884 0.445
3 0.549 0.208 0.245
4 0.278 0.307 0.694
5 0.488 0.331 0.412
6 0.929 0.199 0.328
7 0.349 0.236 0.573
8 0.954 0.275 0.967
9 0.695 0.591 0.662
10 0.889 0.253 0.625
11 0.180 0.123 0.857
12 0.629 0.230 0.775
13 0.990 0.598 0.834
14 0.130 0.211 0.0915
15 0.331 0.464 0.460
16 0.865 0.647 0.599
17 0.778 0.961 0.920
18 0.827 0.676 0.983
19 0.603 0.445 0.0378
20 0.491 0.358 0.578
How do I use lookup to select the value of the corresponding row/column from tb?
i.e.
if the first element of lookup = 1 then I would like to select the value in A from the first row of tb
if the second element of lookup = 2 then I would like to select the value in B from the second row of tb
So I should end up with a 1d vector that is the same size as lookup. It will look like this:
> new data
> [1] 0.780 0.445 0.208 0.694 0.488 ... 0.578
Thanks!
data.frame (but not tibble or data.table) supports indexing on a matrix, so with this data,
set.seed(42)
lookup <- sample(1:3, 20, replace=T)
lookup
# [1] 1 1 1 1 2 2 2 1 3 3 1 1 2 2 2 3 3 1 1 3
tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
head(tb)
# # A tibble: 6 x 3
# A B C
# <dbl> <dbl> <dbl>
# 1 0.514 0.958 0.189
# 2 0.390 0.888 0.271
# 3 0.906 0.640 0.828
# 4 0.447 0.971 0.693
# 5 0.836 0.619 0.241
# 6 0.738 0.333 0.0430
We can do
as.data.frame(tb)[cbind(seq_along(lookup), lookup)]
# [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363
A less-efficient method can be done without as.data.frame:
mapply(`[[`, list(tb), seq_along(lookup), lookup)
# [1] 0.514211784 0.390203467 0.905738131 0.446969628 0.618838207 0.333427211 0.346748248 0.388108283 0.479398564
# [10] 0.197410342 0.832916080 0.007334147 0.171264330 0.261087964 0.514412935 0.581604003 0.157905208 0.037431033
# [19] 0.973539914 0.775823363
## also works with `list(as.data.table(tb))`
Though it does take a big hit in performance (not a surprise):
bench::mark(
sindri_baldur1 = unlist(tb, use.names = FALSE)[seq_along(lookup) + (lookup - 1L)*nrow(tb)],
sindri_baldur2 = unlist(tb)[seq_along(lookup) + (lookup - 1L)*nrow(tb)],
base = as.data.frame(tb)[cbind(seq_along(lookup), lookup)],
mapply = mapply(`[[`, list(tb), seq_along(lookup), lookup),
paulsmith2 = {
tb %>%
mutate(lookup = lookup) %>%
rowwise %>%
mutate(new = c_across(A:C)[lookup]) %>%
pull(new)
},
check = FALSE)
# # A tibble: 5 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 sindri_baldur1 4.5us 5.3us 159430. 736B 15.9 9999 1 62.7ms <NULL> <Rprof~ <benc~ <tibb~
# 2 sindri_baldur2 13.2us 14.7us 56723. 1.44KB 0 10000 0 176.3ms <NULL> <Rprof~ <benc~ <tibb~
# 3 base 78.3us 91.6us 7334. 944B 8.59 3414 4 465.5ms <NULL> <Rprof~ <benc~ <tibb~
# 4 mapply 612.4us 779.45us 942. 720B 6.39 442 3 469.4ms <NULL> <Rprof~ <benc~ <tibb~
# 5 paulsmith2 4.37ms 5.85ms 147. 20.3KB 6.51 68 3 461.1ms <NULL> <Rprof~ <benc~ <tibb~
(I have to use check=FALSE to work with the names introduced in sindri_baldur2, otherwise all results are numerically identical.)
You could:
unlist(tb, use.names = FALSE)[seq_along(lookup) + (lookup - 1L)*nrow(tb)]
# [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907 0.23569430 0.96699908 0.59132105
# [10] 0.25339065 0.85665304 0.22990589 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
# [19] 0.60332436 0.57793740
You could also use.names and keep track of the original location:
unlist(tb)[seq_along(lookup) + (lookup - 1L)*nrow(tb)] |> head()
# B1 C2 B3 C4 A5 B6
# 0.7803585 0.4454140 0.2077139 0.6943507 0.4883060 0.1986791
A base R solution:
tb$lookup <- lookup
tb$new <- apply(tb, 1, function(x) x[x[4]])
new <- tb$new
new
#> [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907
#> [7] 0.23569430 0.96699908 0.59132105 0.25339065 0.85665304 0.22990589
#> [13] 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
#> [19] 0.60332436 0.57793740
Another possible solution, based on tidyverse:
library(tidyverse)
set.seed(100)
lookup <- sample(1:3, 20, replace=T)
tb <- tibble(A=runif(20,0,1), B=runif(20,0,1), C= runif(20,0,1))
tb %>%
mutate(lookup = lookup) %>%
rowwise %>%
mutate(new = c_across(A:C)[lookup]) %>%
pull(new)
#> [1] 0.78035851 0.44541398 0.20771390 0.69435071 0.48830599 0.19867907
#> [7] 0.23569430 0.96699908 0.59132105 0.25339065 0.85665304 0.22990589
#> [13] 0.59757529 0.09151028 0.45952549 0.59939816 0.91972191 0.67639817
#> [19] 0.60332436 0.57793740

map2 error "arguments imply differing number of rows"

I was trying to fit a step function to a dataframe and determine how many cut points produce the lowest mse. And I kept having the same error message:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 149, 1332
My codes and the dummy dataframe go as follows:
library(tidyverse)
library{rsample)
library(broom)
library(rcfss)
set.seed(666)
df <- tibble(egalit_scale = runif(1481, 1, 35), income06 = runif(1481, 1, 25))
training_df <- vfold_cv(df, 10)
mse_df <- function(splits, cc){
model <- glm(egalit_scale ~ cut(income06, cc),
data = analysis(splits))
model_mse <- augment(model, newdata = assessment(splits)) %>%
mse(truth = egalit_scale, estimate = round(.fitted))
model_mse$.estimate
}
tidyr::expand(training_df, id, cc = 2:15) %>%
left_join(training_df) %>%
mutate(mse = map2(splits, cc, mse_df))
The error happens at the step with map2. I tried running each of the 10 folds CV with a specific number of cut points, say 6. It turned out 9 out of the 10 folds worked with the function, but one didn't. Could anyone please help me with this?
The issue is coming from the
augment(model, newdata = assessment(splits))
because in the previous step
model <- glm(egalit_scale ~ cut(income06, cc),
data = analysis(splits))
we do the analysis on the 'splits' instead of the assessment and this results in getting different number of rows, e.g.
out <- tidyr::expand(training_df, id, cc = 2:15) %>%
left_join(training_df)
tmp <- out$splits[[1]]
analysis(tmp)
# A tibble: 1,332 x 2
# egalit_scale income06
# <dbl> <dbl>
# 1 27.3 9.69
# 2 7.71 8.48
# 3 34.3 21.3
# 4 7.85 15.8
# 5 13.3 24.6
# 6 26.2 8.67
# 7 34.3 4.78
# 8 17.9 16.8
# 9 1.45 21.2
#10 9.84 15.7
# … with 1,322 more rows
assessment(tmp)
# A tibble: 149 x 2
# egalit_scale income06
# <dbl> <dbl>
# 1 28.6 14.8
# 2 17.8 2.47
# 3 5.03 24.3
# 4 31.5 5.79
# 5 18.4 18.0
# 6 4.05 8.06
# 7 2.28 8.16
# 8 28.6 16.8
# 9 21.1 7.03
#10 3.67 14.2
# … with 139 more rows
So, if we change the model statement with assessment
mse_df <- function(splits, cc){
model <- glm(egalit_scale ~ cut(income06, cc),
data = assessment(splits))
model_mse <- augment(model, newdata = assessment(splits)) %>%
mse(truth = egalit_scale, estimate = round(.fitted))
model_mse$.estimate
}
library(yardstick)
out1 <- tidyr::expand(training_df, id, cc = 2:15) %>%
left_join(training_df) %>%
mutate(mse = map2_dbl(splits, cc, mse_df))
out1
# A tibble: 140 x 4
# id cc splits mse
# <chr> <int> <named list> <dbl>
# 1 Fold01 2 <split [1.3K/149]> 94.9
# 2 Fold01 3 <split [1.3K/149]> 94.6
# 3 Fold01 4 <split [1.3K/149]> 93.8
# 4 Fold01 5 <split [1.3K/149]> 94.5
# 5 Fold01 6 <split [1.3K/149]> 94.0
# 6 Fold01 7 <split [1.3K/149]> 92.0
# 7 Fold01 8 <split [1.3K/149]> 88.9
# 8 Fold01 9 <split [1.3K/149]> 91.2
# 9 Fold01 10 <split [1.3K/149]> 92.8
#10 Fold01 11 <split [1.3K/149]> 86.0
# … with 130 more rows

regression by group and retain all the columns in R

I am doing a linear regression by group and want to extract the residuals of the regression
library(dplyr)
set.seed(124)
dat <- data.frame(ID = sample(111:503, 18576, replace = T),
ID2 = sample(11:50, 18576, replace = T),
ID3 = sample(1:14, 18576, replace = T),
yearRef = sample(1998:2014, 18576, replace = T),
value = rnorm(18576))
resid <- dat %>% dplyr::group_by(ID3) %>%
do(augment(lm(value ~ yearRef, data=.))) %>% ungroup()
How do I retain the ID, ID2 as well in the resid. At the moment, it only retains the ID3 in the final data frame
Use group_split then loop through each group using map_dfr to bind ID, ID2 and augment output using bind_cols
library(dplyr)
library(purrr)
dat %>% group_split(ID3) %>%
map_dfr(~bind_cols(select(.x,ID,ID2), augment(lm(value~yearRef, data=.x))), .id = "ID3")
# A tibble: 18,576 x 12
ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma .cooksd
<chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 196 16 -0.385 2009 -0.0406 0.0308 -0.344 1.00e-3 0.973 6.27e-5
2 1 372 47 -0.793 2012 -0.0676 0.0414 -0.726 1.81e-3 0.973 5.05e-4
3 1 470 15 -0.496 2011 -0.0586 0.0374 -0.438 1.48e-3 0.973 1.50e-4
4 1 242 40 -1.13 2010 -0.0496 0.0338 -1.08 1.21e-3 0.973 7.54e-4
5 1 471 34 1.28 2006 -0.0135 0.0262 1.29 7.26e-4 0.972 6.39e-4
6 1 434 35 -1.09 1998 0.0586 0.0496 -1.15 2.61e-3 0.973 1.82e-3
7 1 467 45 -0.0663 2011 -0.0586 0.0374 -0.00769 1.48e-3 0.973 4.64e-8
8 1 334 27 -1.37 2003 0.0135 0.0305 -1.38 9.86e-4 0.972 9.92e-4
9 1 186 25 -0.0195 2003 0.0135 0.0305 -0.0331 9.86e-4 0.973 5.71e-7
10 1 114 34 1.09 2014 -0.0857 0.0500 1.18 2.64e-3 0.973 1.94e-3
# ... with 18,566 more rows, and 1 more variable: .std.resid <dbl>
Taking the "many models" approach, you can nest the data on ID3 and use purrr::map to create a list-column of the broom::augment data frames. The data list-column has all the original columns aside from ID3; map into that and select just the ones you want. Here I'm assuming you want to keep any column that starts with "ID", but you can change this. Then unnest both the data and the augment data frames.
library(dplyr)
library(tidyr)
dat %>%
group_by(ID3) %>%
nest() %>%
mutate(aug = purrr::map(data, ~broom::augment(lm(value ~ yearRef, data = .))),
data = purrr::map(data, select, starts_with("ID"))) %>%
unnest(c(data, aug))
#> # A tibble: 18,576 x 12
#> # Groups: ID3 [14]
#> ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma
#> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 11 431 15 0.619 2002 0.0326 0.0346 0.586 1.21e-3 0.995
#> 2 11 500 21 -0.432 2000 0.0299 0.0424 -0.462 1.82e-3 0.995
#> 3 11 392 28 -0.246 1998 0.0273 0.0515 -0.273 2.67e-3 0.995
#> 4 11 292 40 -0.425 1998 0.0273 0.0515 -0.452 2.67e-3 0.995
#> 5 11 175 36 -0.258 1999 0.0286 0.0468 -0.287 2.22e-3 0.995
#> 6 11 419 23 3.13 2005 0.0365 0.0273 3.09 7.54e-4 0.992
#> 7 11 329 17 -0.0414 2007 0.0391 0.0274 -0.0806 7.57e-4 0.995
#> 8 11 284 23 -0.450 2006 0.0378 0.0268 -0.488 7.25e-4 0.995
#> 9 11 136 28 -0.129 2006 0.0378 0.0268 -0.167 7.25e-4 0.995
#> 10 11 118 17 -1.55 2013 0.0470 0.0470 -1.60 2.24e-3 0.995
#> # … with 18,566 more rows, and 2 more variables: .cooksd <dbl>,
#> # .std.resid <dbl>

Resources