map2 error "arguments imply differing number of rows" - r

I was trying to fit a step function to a dataframe and determine how many cut points produce the lowest mse. And I kept having the same error message:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 149, 1332
My codes and the dummy dataframe go as follows:
library(tidyverse)
library{rsample)
library(broom)
library(rcfss)
set.seed(666)
df <- tibble(egalit_scale = runif(1481, 1, 35), income06 = runif(1481, 1, 25))
training_df <- vfold_cv(df, 10)
mse_df <- function(splits, cc){
model <- glm(egalit_scale ~ cut(income06, cc),
data = analysis(splits))
model_mse <- augment(model, newdata = assessment(splits)) %>%
mse(truth = egalit_scale, estimate = round(.fitted))
model_mse$.estimate
}
tidyr::expand(training_df, id, cc = 2:15) %>%
left_join(training_df) %>%
mutate(mse = map2(splits, cc, mse_df))
The error happens at the step with map2. I tried running each of the 10 folds CV with a specific number of cut points, say 6. It turned out 9 out of the 10 folds worked with the function, but one didn't. Could anyone please help me with this?

The issue is coming from the
augment(model, newdata = assessment(splits))
because in the previous step
model <- glm(egalit_scale ~ cut(income06, cc),
data = analysis(splits))
we do the analysis on the 'splits' instead of the assessment and this results in getting different number of rows, e.g.
out <- tidyr::expand(training_df, id, cc = 2:15) %>%
left_join(training_df)
tmp <- out$splits[[1]]
analysis(tmp)
# A tibble: 1,332 x 2
# egalit_scale income06
# <dbl> <dbl>
# 1 27.3 9.69
# 2 7.71 8.48
# 3 34.3 21.3
# 4 7.85 15.8
# 5 13.3 24.6
# 6 26.2 8.67
# 7 34.3 4.78
# 8 17.9 16.8
# 9 1.45 21.2
#10 9.84 15.7
# … with 1,322 more rows
assessment(tmp)
# A tibble: 149 x 2
# egalit_scale income06
# <dbl> <dbl>
# 1 28.6 14.8
# 2 17.8 2.47
# 3 5.03 24.3
# 4 31.5 5.79
# 5 18.4 18.0
# 6 4.05 8.06
# 7 2.28 8.16
# 8 28.6 16.8
# 9 21.1 7.03
#10 3.67 14.2
# … with 139 more rows
So, if we change the model statement with assessment
mse_df <- function(splits, cc){
model <- glm(egalit_scale ~ cut(income06, cc),
data = assessment(splits))
model_mse <- augment(model, newdata = assessment(splits)) %>%
mse(truth = egalit_scale, estimate = round(.fitted))
model_mse$.estimate
}
library(yardstick)
out1 <- tidyr::expand(training_df, id, cc = 2:15) %>%
left_join(training_df) %>%
mutate(mse = map2_dbl(splits, cc, mse_df))
out1
# A tibble: 140 x 4
# id cc splits mse
# <chr> <int> <named list> <dbl>
# 1 Fold01 2 <split [1.3K/149]> 94.9
# 2 Fold01 3 <split [1.3K/149]> 94.6
# 3 Fold01 4 <split [1.3K/149]> 93.8
# 4 Fold01 5 <split [1.3K/149]> 94.5
# 5 Fold01 6 <split [1.3K/149]> 94.0
# 6 Fold01 7 <split [1.3K/149]> 92.0
# 7 Fold01 8 <split [1.3K/149]> 88.9
# 8 Fold01 9 <split [1.3K/149]> 91.2
# 9 Fold01 10 <split [1.3K/149]> 92.8
#10 Fold01 11 <split [1.3K/149]> 86.0
# … with 130 more rows

Related

Fitting a Different Linear Model to Each Player [duplicate]

I would like to fit a model for each hour(the factor variable) using dplyr, I'm getting an error, and i'm not quite sure what's wrong.
df.h <- data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.h <- tbl_df(df.h)
df.h <- group_by(df.h, hour)
group_size(df.h) # checks out, 21 obs. for each factor variable
# different attempts:
reg.models <- do(df.h, formula = price ~ wind + temp)
reg.models <- do(df.h, .f = lm(price ~ wind + temp, data = df.h))
I've tried various variations, but I can't get it to work.
The easiest way to do this, circa May 2015 is to use broom. broom contains three functions that deal with complex returned objects from statistical operations by groups: tidy (which deals with coefficient vectors from statistical operations by groups), glance (which deals with summary statistics from statistical operations by groups), and augment (which deals with observation level results from statistical operations by groups).
Here is a demonstration of its use to extract the various results of linear regression by groups into tidy data_frames.
tidy:
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
dfHourCoef
which gives,
Source: local data frame [72 x 6]
Groups: hour
hour term estimate std.error statistic p.value
1 1 (Intercept) 53.336069324 21.33190104 2.5002961 0.022294293
2 1 wind -0.008475175 0.01338668 -0.6331053 0.534626575
3 1 temp 1.180019541 0.79178607 1.4903262 0.153453756
4 2 (Intercept) 77.737788772 23.52048754 3.3051096 0.003936651
5 2 wind -0.008437212 0.01432521 -0.5889765 0.563196358
6 2 temp -0.731265113 1.00109489 -0.7304653 0.474506855
7 3 (Intercept) 38.292039924 17.55361626 2.1814331 0.042655670
8 3 wind 0.005422492 0.01407478 0.3852630 0.704557388
9 3 temp 0.426765270 0.83672863 0.5100402 0.616220435
10 4 (Intercept) 30.603119492 21.05059583 1.4537888 0.163219027
.. ... ... ... ... ... ...
augment:
# get the predictions by group in a tidy data_frame
dfHourPred = augment(dfHour, fitHour)
dfHourPred
which gives,
Source: local data frame [504 x 11]
Groups: hour
hour price wind temp .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 1 83.8414055 67.3780 -6.199231 45.44982 22.42649 38.391590 0.27955950 42.24400 0.1470891067 1.0663820
2 1 0.3061628 2073.7540 15.134085 53.61916 14.10041 -53.312993 0.11051343 41.43590 0.0735584714 -1.3327207
3 1 80.3790032 520.5949 24.711938 78.08451 20.03558 2.294497 0.22312869 43.64059 0.0003606305 0.0613746
4 1 121.9023855 1618.0864 12.382588 54.23420 10.31293 67.668187 0.05911743 40.23212 0.0566557575 1.6447224
5 1 -0.4039594 1542.8150 -5.544927 33.71732 14.53349 -34.121278 0.11740628 42.74697 0.0325125137 -0.8562896
6 1 29.8269832 396.6951 6.134694 57.21307 16.04995 -27.386085 0.14318542 43.05124 0.0271028701 -0.6975290
7 1 -7.1865483 2009.9552 -5.657871 29.62495 16.93769 -36.811497 0.15946292 42.54487 0.0566686969 -0.9466312
8 1 -7.8548693 2447.7092 22.043029 58.60251 19.94686 -66.457379 0.22115706 39.63999 0.2983443034 -1.7753911
9 1 94.8736726 1525.3144 24.484066 69.30044 15.93352 25.573234 0.14111563 43.12898 0.0231796755 0.6505701
10 1 54.4643001 2473.2234 -7.656520 23.34022 21.83043 31.124076 0.26489650 42.74790 0.0879837510 0.8558507
.. ... ... ... ... ... ... ... ... ... ... ...
glance:
# get the summary statistics by group in a tidy data_frame
dfHourSumm = glance(dfHour, fitHour)
dfHourSumm
which gives,
Source: local data frame [24 x 12]
Groups: hour
hour r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
1 1 0.12364561 0.02627290 42.41546 1.2698179 0.30487225 3 -106.8769 221.7538 225.9319 32383.29 18
2 2 0.03506944 -0.07214506 36.79189 0.3270961 0.72521125 3 -103.8900 215.7799 219.9580 24365.58 18
3 3 0.02805424 -0.07993974 39.33621 0.2597760 0.77406651 3 -105.2942 218.5884 222.7665 27852.07 18
4 4 0.17640603 0.08489559 41.37115 1.9277147 0.17434859 3 -106.3534 220.7068 224.8849 30808.30 18
5 5 0.12575453 0.02861615 42.27865 1.2945915 0.29833246 3 -106.8091 221.6181 225.7962 32174.72 18
6 6 0.08114417 -0.02095092 35.80062 0.7947901 0.46690268 3 -103.3164 214.6328 218.8109 23070.31 18
7 7 0.21339168 0.12599076 32.77309 2.4415266 0.11529934 3 -101.4609 210.9218 215.0999 19333.36 18
8 8 0.21655629 0.12950699 40.92788 2.4877430 0.11119114 3 -106.1272 220.2543 224.4324 30151.65 18
9 9 0.23388711 0.14876346 35.48431 2.7476160 0.09091487 3 -103.1300 214.2601 218.4381 22664.45 18
10 10 0.18326177 0.09251307 40.77241 2.0194425 0.16171339 3 -106.0472 220.0945 224.2726 29923.01 18
.. ... ... ... ... ... ... .. ... ... ... ... ...
As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).
library(dplyr)
library(broom)
library(tidyr)
set.seed(42)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups: hour [24]
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 87.4 15.8 5.55 0.0000289
# 2 1 wind -0.0129 0.0120 -1.08 0.296
# 3 1 temp 0.588 0.849 0.693 0.497
# 4 2 (Intercept) 92.3 21.6 4.27 0.000466
# 5 2 wind -0.0227 0.0134 -1.69 0.107
# 6 2 temp -0.216 0.841 -0.257 0.800
# 7 3 (Intercept) 61.1 18.6 3.29 0.00409
# 8 3 wind 0.00471 0.0128 0.367 0.718
# 9 3 temp 0.425 0.964 0.442 0.664
# 10 4 (Intercept) 31.6 15.3 2.07 0.0529
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups: hour [24]
# hour price wind temp .fitted .resid .hat .sigma .cooksd .std.resid
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 113. 288. -1.75 82.7 30.8 0.123 37.8 0.0359 0.877
# 2 1 117. 2234. 18.4 69.5 47.0 0.201 36.4 0.165 1.40
# 3 1 28.6 1438. 4.75 71.7 -43.1 0.0539 37.1 0.0265 -1.18
# 4 1 102. 366. 9.77 88.5 13.7 0.151 38.4 0.00926 0.395
# 5 1 76.6 2257. -4.69 55.6 21.0 0.245 38.2 0.0448 0.644
# 6 1 60.1 633. -3.18 77.4 -17.3 0.0876 38.4 0.00749 -0.484
# 7 1 89.4 376. -4.16 80.1 9.31 0.119 38.5 0.00314 0.264
# 8 1 8.18 1921. 19.2 74.0 -65.9 0.173 34.4 0.261 -1.93
# 9 1 78.7 575. -6.11 76.4 2.26 0.111 38.6 0.000170 0.0640
# 10 1 85.2 763. -0.618 77.2 7.94 0.0679 38.6 0.00117 0.219
# # … with 494 more rows
df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups: hour [24]
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0679 -0.0357 37.5 0.655 0.531 2 -104. 217.
# 2 2 0.139 0.0431 42.7 1.45 0.261 2 -107. 222.
# 3 3 0.0142 -0.0953 43.1 0.130 0.879 2 -107. 222.
# 4 4 0.0737 -0.0293 36.7 0.716 0.502 2 -104. 216.
# 5 5 0.213 0.126 37.8 2.44 0.115 2 -104. 217.
# 6 6 0.0813 -0.0208 33.5 0.796 0.466 2 -102. 212.
# 7 7 0.0607 -0.0437 40.7 0.582 0.569 2 -106. 220.
# 8 8 0.153 0.0592 36.3 1.63 0.224 2 -104. 215.
# 9 9 0.166 0.0736 36.5 1.79 0.195 2 -104. 216.
# 10 10 0.110 0.0108 40.0 1.11 0.351 2 -106. 219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# # df.residual <int>, nobs <int>
Credits to Bob Muenchen's Blog for inspiration on that.
In dplyr 0.4, you can do:
df.h %>% do(model = lm(price ~ wind + temp, data = .))
from the documentation for do:
.f: a function to apply to each piece. The first unnamed argument supplied to .f will be a data frame.
So:
reg.models <- do(df.h,
.f=function(data){
lm(price ~ wind + temp, data=data)
})
Probably useful to also save which hour the model was fitted for:
reg.models <- do(df.h,
.f=function(data){
m <- lm(price ~ wind + temp, data=data)
m$hour <- unique(data$hour)
m
})
I believe there's a more compact answer than loki's answer, which abandons the since replaced/superseded do():
library(dplyr)
library(broom)
library(tidyr)
h.lm <- df.h %>%
nest_by(hour) %>%
mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
summarise(tidy_out = list(tidy(fitHour)),
glance_out = list(glance(fitHour)),
augment_out = list(augment(fitHour))) %>%
ungroup()
h.lm
# # A tibble: 24 x 4
# hour tidy_out glance_out augment_out
# <fct> <list> <list> <list>
# 1 1 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 2 2 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 3 3 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 4 4 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 5 5 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 6 6 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 7 7 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 8 8 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 9 9 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# 10 10 <tibble [3 × 5]> <tibble [1 × 12]> <tibble [21 × 9]>
# # … with 14 more rows
similar to their answer, in order to access, simply unnest whichever component is desired:
unnest(select(h.lm, hour, tidy_out))
# # A tibble: 72 x 6
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 63.2 20.9 3.02 0.00728
# 2 1 wind -0.00237 0.0139 -0.171 0.866
# 3 1 temp -0.266 0.950 -0.280 0.783
# 4 2 (Intercept) 65.1 23.0 2.83 0.0111
# 5 2 wind 0.00691 0.0129 0.535 0.599
# 6 2 temp -0.448 0.877 -0.510 0.616
# 7 3 (Intercept) 65.2 17.8 3.67 0.00175
# 8 3 wind 0.00515 0.0112 0.458 0.652
# 9 3 temp -1.87 0.695 -2.69 0.0148
# 10 4 (Intercept) 49.7 17.6 2.83 0.0111
# # … with 62 more rows
I think you can use dplyr in more proper way where you don't need to define function as in #fabians anwser.
results<-df.h %.%
group_by(hour) %.%
do(failwith(NULL, lm), formula = price ~ wind + temp)
or
results<-do(group_by(tbl_df(df.h), hour),
failwith(NULL, lm), formula = price ~ wind + temp)
EDIT:
Of course it also works without failwith
results<-df.h %.%
group_by(hour) %.%
do(lm, formula = price ~ wind + temp)
results<-do(group_by(tbl_df(df.h), hour),
lm, formula = price ~ wind + temp)
As of dplyr 1.0.0, group_split gives a handy shortcut for this action:
library(dplyr)
library(broom)
library(purrr)
df.h <- data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
df.g <- group_split(df.h, hour)
map_dfr(df.g, function(x) tidy(lm(price ~ wind + temp, data=x)))
#> # A tibble: 72 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -10.4 20.3 -0.512 0.615
#> 2 wind 0.0377 0.0117 3.23 0.00467
#> 3 temp 1.34 0.890 1.50 0.150
#> 4 (Intercept) 34.6 18.6 1.86 0.0799
#> 5 wind 0.0214 0.0125 1.71 0.104
#> 6 temp 0.332 0.865 0.384 0.706
#> 7 (Intercept) 42.5 15.3 2.79 0.0122
#> 8 wind 0.0103 0.0116 0.888 0.386
#> 9 temp -0.542 0.736 -0.736 0.471
#> 10 (Intercept) 64.1 18.8 3.41 0.00312
#> # … with 62 more rows
Created on 2021-03-04 by the reprex package (v1.0.0)
A few revisions of the tidyverse late the do() operator is superseded and we can fit one model per group with one line of code less.
library("broom")
library("tidyverse")
df.h <- data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = -10, max = 25)
)
df.h %>%
group_by(hour) %>%
group_modify(
# Use `tidy`, `glance` or `augment` to extract different information from the fitted models.
~ tidy(lm(price ~ wind + temp, data = .))
)
#> # A tibble: 72 × 6
#> # Groups: hour [24]
#> hour term estimate std.error statistic p.value
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 73.9 16.3 4.52 0.000266
#> 2 1 wind -0.0256 0.0119 -2.15 0.0456
#> 3 1 temp 1.72 0.861 2.00 0.0604
#> 4 2 (Intercept) 81.5 18.4 4.42 0.000331
#> 5 2 wind -0.0111 0.00973 -1.14 0.270
#> 6 2 temp -1.60 0.763 -2.09 0.0506
#> 7 3 (Intercept) 59.9 16.1 3.73 0.00154
#> 8 3 wind 0.00358 0.0102 0.349 0.731
#> 9 3 temp -1.82 0.664 -2.74 0.0134
#> 10 4 (Intercept) 49.6 18.5 2.69 0.0151
#> # … with 62 more rows
Created on 2022-04-20 by the reprex package (v2.0.1)

Looping linear regression output in a data frame in r

I have a dataset below in which I want to do linear regression for each country and state and then cbind the predicted values in the dataset:
Final data frame after adding three more columns:
I have done it for one country and one area but want to do it for each country and area and put the predicted, upper and lower limit values back in the data set by cbind:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
data_1 <- data[(data$country=="US" & data$Area=="G"),]
model <- lm(amount ~ week, data = data_1)
pre <- predict(model,newdata = data_1,interval = "prediction",level = 0.95)
pre
How can I loop this for other combination of country and Area?
...and a Base R solution:
data <- data.frame(country = c("US","US","US","US","US","US","US","US","US","US","UK","UK","UK","UK","UK"),
Area = c("G","G","G","G","G","I","I","I","I","I","A","A","A","A","A"),
week = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),amount = c(12,23,34,32,12,12,34,45,65,45,45,34,23,43,43))
splitVar <- paste0(data$country,"-",data$Area)
dfList <- split(data,splitVar)
result <- do.call(rbind,lapply(dfList,function(x){
model <- lm(amount ~ week, data = x)
cbind(x,predict(model,newdata = x,interval = "prediction",level = 0.95))
}))
result
...the results:
country Area week amount fit lwr upr
UK-A.11 UK A 1 45 36.6 -6.0463638 79.24636
UK-A.12 UK A 2 34 37.1 -1.3409128 75.54091
UK-A.13 UK A 3 23 37.6 0.6671656 74.53283
UK-A.14 UK A 4 43 38.1 -0.3409128 76.54091
UK-A.15 UK A 5 43 38.6 -4.0463638 81.24636
US-G.1 US G 1 12 20.8 -27.6791493 69.27915
US-G.2 US G 2 23 21.7 -21.9985147 65.39851
US-G.3 US G 3 34 22.6 -19.3841749 64.58417
US-G.4 US G 4 32 23.5 -20.1985147 67.19851
US-G.5 US G 5 12 24.4 -24.0791493 72.87915
US-I.6 US I 1 12 20.8 -33.8985900 75.49859
US-I.7 US I 2 34 30.5 -18.8046427 79.80464
US-I.8 US I 3 45 40.2 -7.1703685 87.57037
US-I.9 US I 4 65 49.9 0.5953573 99.20464
US-I.10 US I 5 45 59.6 4.9014100 114.29859
We can also use function augment from package broom to get your desired information:
library(purrr)
library(broom)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(models = map(data, ~ lm(amount ~ week, data = .)),
aug = map(models, ~ augment(.x, interval = "prediction"))) %>%
unnest(aug) %>%
select(country, Area, amount, week, .fitted, .lower, .upper)
# A tibble: 15 x 7
# Groups: country, Area [3]
country Area amount week .fitted .lower .upper
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 US G 12 1 20.8 -27.7 69.3
2 US G 23 2 21.7 -22.0 65.4
3 US G 34 3 22.6 -19.4 64.6
4 US G 32 4 23.5 -20.2 67.2
5 US G 12 5 24.4 -24.1 72.9
6 US I 12 1 20.8 -33.9 75.5
7 US I 34 2 30.5 -18.8 79.8
8 US I 45 3 40.2 -7.17 87.6
9 US I 65 4 49.9 0.595 99.2
10 US I 45 5 59.6 4.90 114.
11 UK A 45 1 36.6 -6.05 79.2
12 UK A 34 2 37.1 -1.34 75.5
13 UK A 23 3 37.6 0.667 74.5
14 UK A 43 4 38.1 -0.341 76.5
15 UK A 43 5 38.6 -4.05 81.2
Here is a tidyverse way to do this for every combination of country and Area.
library(tidyverse)
data %>%
group_by(country, Area) %>%
nest() %>%
mutate(model = map(data, ~ lm(amount ~ week, data = .x)),
result = map2(model, data, ~data.frame(predict(.x, newdata = .y,
interval = "prediction",level = 0.95)))) %>%
ungroup %>%
select(-model) %>%
unnest(c(data, result))
# country Area week amount fit lwr upr
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 US G 1 12 20.8 -27.7 69.3
# 2 US G 2 23 21.7 -22.0 65.4
# 3 US G 3 34 22.6 -19.4 64.6
# 4 US G 4 32 23.5 -20.2 67.2
# 5 US G 5 12 24.4 -24.1 72.9
# 6 US I 1 12 20.8 -33.9 75.5
# 7 US I 2 34 30.5 -18.8 79.8
# 8 US I 3 45 40.2 -7.17 87.6
# 9 US I 4 65 49.9 0.595 99.2
#10 US I 5 45 59.6 4.90 114.
#11 UK A 1 45 36.6 -6.05 79.2
#12 UK A 2 34 37.1 -1.34 75.5
#13 UK A 3 23 37.6 0.667 74.5
#14 UK A 4 43 38.1 -0.341 76.5
#15 UK A 5 43 38.6 -4.05 81.2
And one more:
library(tidyverse)
data %>%
mutate(CountryArea=paste0(country,Area) %>% factor %>% fct_inorder) %>%
split(.$CountryArea) %>%
map(~lm(amount~week, data=.)) %>%
map(predict, interval = "prediction",level = 0.95) %>%
reduce(rbind) %>%
cbind(data, .)
country Area week amount fit lwr upr
1 US G 1 12 20.8 -27.6791493 69.27915
2 US G 2 23 21.7 -21.9985147 65.39851
3 US G 3 34 22.6 -19.3841749 64.58417
4 US G 4 32 23.5 -20.1985147 67.19851
5 US G 5 12 24.4 -24.0791493 72.87915
6 US I 1 12 20.8 -33.8985900 75.49859
7 US I 2 34 30.5 -18.8046427 79.80464
8 US I 3 45 40.2 -7.1703685 87.57037
9 US I 4 65 49.9 0.5953573 99.20464
10 US I 5 45 59.6 4.9014100 114.29859
11 UK A 1 45 36.6 -6.0463638 79.24636
12 UK A 2 34 37.1 -1.3409128 75.54091
13 UK A 3 23 37.6 0.6671656 74.53283
14 UK A 4 43 38.1 -0.3409128 76.54091
15 UK A 5 43 38.6 -4.0463638 81.24636

pmap columns by group to get weighted row-wise mean

I'm trying to get the weighted row mean of certain columns by group.
library(tidyverse)
set.seed(1)
df <- data.frame(group = rep(LETTERS[3:4], each = 10),
x = runif(n = 10, min = 10, max = 15),
y = runif(n = 10, min = 100, max = 150),
z = runif(n = 10, min = 100, max = 150))
df
# group x y z
# 1 C 11.32754 110.2987 146.7353
# 2 C 11.86062 108.8278 110.6071
# 3 C 12.86427 134.3511 132.5837
# 4 C 14.54104 119.2052 106.2778
# 5 C 11.00841 138.4921 113.3610
# 6 C 14.49195 124.8850 119.3057
# 7 C 14.72338 135.8809 100.6695
# 8 C 13.30399 149.5953 119.1194
# 9 C 13.14557 119.0018 143.4845
# 10 C 10.30893 138.8723 117.0174
# 11 D 11.32754 110.2987 146.7353
# 12 D 11.86062 108.8278 110.6071
# 13 D 12.86427 134.3511 132.5837
# 14 D 14.54104 119.2052 106.2778
# 15 D 11.00841 138.4921 113.3610
# 16 D 14.49195 124.8850 119.3057
# 17 D 14.72338 135.8809 100.6695
# 18 D 13.30399 149.5953 119.1194
# 19 D 13.14557 119.0018 143.4845
# 20 D 10.30893 138.8723 117.0174
To get crude row mean of x, y, z, I can do:
df %>%
mutate(rmean = pmap_dbl(list(x, y, z), ~mean(c(...))))
But I want to weight them by these weights
dfweight <- data.frame(group = c("C", "C", "C",
"D", "D", "D"),
cat = c("x", "y", "z",
"x", "y", "z"),
weights = c(.2, .7, .1,
.4, .1, .5))
# group cat weights
# 1 C x 0.2
# 2 C y 0.7
# 3 C z 0.1
# 4 D x 0.4
# 5 D y 0.1
# 6 D z 0.5
I thought I should extract the weights first:
dfweight_split <- lapply(split(dfweight, dfweight$group), function (x) x$weights)
dfweight_split
# $C
# [1] 0.2 0.7 0.1
# $D
# [1] 0.4 0.1 0.5
But I'm then unsure how to pmap/map over these?
df %>%
group_by(group) %>%
mutate(wmean = pmap_dbl(list(x, y, z), ~weight.mean(c(..., dfweight_split))))
#OR
df %>%
group_by(group) %>%
mutate(wmean = map2(list(x, y, z), dfweight_split, ~weight.mean(.x, .y)))
Happy to see base solutions too. A similar post is here.
thanks
If we want to use pmap, make sure that the 'dfweight' data columns are also in the same dataset. An option is to reshape to wide with pivot_wider, then do a join (right_join) and use pmap to loop over the rows, extract the column elements on the same order with the notation .. before the index, pass those as vector arguments in weighted.mean to create the column in mutate
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
dfweight %>%
pivot_wider(names_from = cat, values_from = weights) %>%
rename_at(-1, ~ str_c(., '_weight')) %>%
right_join(df) %>%
mutate(wmean = pmap_dbl(select(., -group),
~ weighted.mean(c(..4, ..5, ..6), c(..1, ..2, ..3)))) %>%
select(-ends_with('weight'))
# A tibble: 20 x 5
# group x y z wmean
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 C 11.3 110. 147. 94.1
# 2 C 11.9 109. 111. 89.6
# 3 C 12.9 134. 133. 110.
# 4 C 14.5 119. 106. 97.0
# 5 C 11.0 138. 113. 110.
# 6 C 14.5 125. 119. 102.
# 7 C 14.7 136. 101. 108.
# 8 C 13.3 150. 119. 119.
# 9 C 13.1 119. 143. 100.
#10 C 10.3 139. 117. 111.
#11 D 11.3 110. 147. 88.9
#12 D 11.9 109. 111. 70.9
#13 D 12.9 134. 133. 84.9
#14 D 14.5 119. 106. 70.9
#15 D 11.0 138. 113. 74.9
#16 D 14.5 125. 119. 77.9
#17 D 14.7 136. 101. 69.8
#18 D 13.3 150. 119. 79.8
#19 D 13.1 119. 143. 88.9
#20 D 10.3 139. 117. 76.5
I think it will be easier to perform this calculation if you reshape the data to long format. Then join the data with dfweight for each group and column name and find weighted mean for each row.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = x:z, names_to = 'cat') %>%
left_join(dfweight, by = c('group', 'cat')) %>%
group_by(group, row) %>%
mutate(weight_mean = weighted.mean(value, weights)) %>%
ungroup %>%
select(-weights) %>%
pivot_wider(names_from = cat, values_from = value) %>%
select(-row)
# group weight_mean x y z
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 C 94.1 11.3 110. 147.
# 2 C 89.6 11.9 109. 111.
# 3 C 110. 12.9 134. 133.
# 4 C 97.0 14.5 119. 106.
# 5 C 110. 11.0 138. 113.
# 6 C 102. 14.5 125. 119.
# 7 C 108. 14.7 136. 101.
# 8 C 119. 13.3 150. 119.
# 9 C 100. 13.1 119. 143.
#10 C 111. 10.3 139. 117.
#11 D 88.9 11.3 110. 147.
#12 D 70.9 11.9 109. 111.
#13 D 84.9 12.9 134. 133.
#14 D 70.9 14.5 119. 106.
#15 D 74.9 11.0 138. 113.
#16 D 77.9 14.5 125. 119.
#17 D 69.8 14.7 136. 101.
#18 D 79.8 13.3 150. 119.
#19 D 88.9 13.1 119. 143.
#20 D 76.5 10.3 139. 117.
I got different random numbers with set.seed(1).

How to find mean value using multiple columns of a R data.frame?

I am trying to find mean of A and B for each row and save it as separate column but seems like the code only average the first row and fill the rest of the rows with that value. Any suggestion how to fix this?
library(tidyverse)
library(lubridate)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("2001-01-01"), to = as.Date("2003-12-31"), by = "day"),
A = runif(1095, 1,60),
Z = runif(1095, 5,100)) %>%
mutate(MeanofAandZ= mean(A:Z))
Are you looking for this:
DF %>% rowwise() %>% mutate(MeanofAandZ = mean(c_across(A:Z)))
# A tibble: 1,095 x 4
# Rowwise:
Date A Z MeanofAandZ
<date> <dbl> <dbl> <dbl>
1 2001-01-01 26.5 7.68 17.1
2 2001-01-02 54.9 33.1 44.0
3 2001-01-03 37.1 82.0 59.5
4 2001-01-04 6.91 18.0 12.4
5 2001-01-05 53.0 8.76 30.9
6 2001-01-06 26.1 7.63 16.9
7 2001-01-07 59.3 30.8 45.0
8 2001-01-08 39.9 14.6 27.3
9 2001-01-09 59.2 93.6 76.4
10 2001-01-10 30.7 89.1 59.9
you can do it with Base R: rowMeans
Full Base R:
DF$MeanofAandZ <- rowMeans(DF[c("A", "Z")])
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
or inside a mutate:
library(dplyr)
DF <- DF %>% mutate(MeanofAandZ = rowMeans(cbind(A,Z)))
head(DF)
#> Date A Z MeanofAandZ
#> 1 2001-01-01 17.967074 76.92436 47.44572
#> 2 2001-01-02 47.510003 99.28325 73.39663
#> 3 2001-01-03 25.129638 64.33253 44.73109
#> 4 2001-01-04 53.098027 32.42556 42.76179
#> 5 2001-01-05 56.487570 23.99162 40.23959
#> 6 2001-01-06 3.687833 81.08720 42.38751
We can also do
DF$MeanofAandZ <- Reduce(`+`, DF[c("A", "Z")])/2
Or using apply
DF$MeanofAandZ <- apply(DF[c("A", "Z")], 1, mean)

Sort a dataframe according to characters in R [duplicate]

This question already has answers here:
R Sort strings according to substring
(2 answers)
Closed 2 years ago.
I got the dataframe (code) and I I want to sort it according to combName in a numerical order.
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14.0
# ... with 1,098 more rows
Afterwards the left column should look like:
> code
# A tibble: 1,108 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-2 ...
3 20-3 ...
4 20-4 ...
5 20-5 ...
...
10 20-10 14.7
# ... with 1,098 more rows
It do not know what I can do to reach this format.
Does this work:
library(dplyr)
library(tidyr)
df
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-102 25.4
2 20-100 21.2
3 20-101 17.6
4 20-105 24.3
5 20-10 14.7
6 20-103 46.3
7 20-104 68.7
8 20-1 8.05
9 20-106 46.3
10 20-107 14
df %>% separate(combName, into = c('1','2'), sep = '-', remove = F) %>%
type.convert(as.is = T) %>% arrange(`1`,`2`) %>% select(-c(`1`,`2`))
# A tibble: 10 x 2
combName sumLength
<chr> <dbl>
1 20-1 8.05
2 20-10 14.7
3 20-100 21.2
4 20-101 17.6
5 20-102 25.4
6 20-103 46.3
7 20-104 68.7
8 20-105 24.3
9 20-106 46.3
10 20-107 14

Resources