dplyr - mutate using function that uses other column data as argument? - r

I have a list with 3 regression models, called logregs. My data has a column called type that only has integers 1, 2, and 3, which are used to decide which regression model from logregs should be used, and a column called adstock which is the only independent variable used in the regression models.
I'm trying to do something like:
dataframe %>% mutate(probability = predict(logregs[[type]], type = "prediction", newdata = adstock) )
Sample data frame:
structure(list(type = c(3L, 3L, 3L, 3L, 3L, 3L), adstock = c(1.7984,
1.7984, 2.7984, 6.7984, 6.5968, 4.992)), row.names = c(NA, 6L
), class = "data.frame")
(unfortunately, the logregs models are too large to dput here)
How is this achievable using dplyr?

Yes, but you need to take some more care on subsetting logregs, and use data.frame on your newdata=.
I'll generate a quick set of models based on mtcars.
library(dplyr)
models <- mtcars %>%
group_by(cyl = as.character(cyl)) %>%
nest() %>%
mutate(mdl = map(data, ~ lm(mpg ~ disp, data = .x))) %>%
arrange(cyl) %>%
select(cyl, mdl) %>%
deframe()
models
# $`4`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 40.8720 -0.1351
# $`6`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 19.081987 0.003605
# $`8`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 22.03280 -0.01963
Note that they are indexed on the character of the number of cylinders, since otherwise numeric indexing can be confusing.
Let's modify the mtcars$disp a little and to use it again:
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2)
# # A tibble: 6 x 11
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3
The [[ indexing on your logregs expects a single type, but you're actually passing a vector. Since my data here is still grouped, I can go with the first of the group variable (cyl) and do just a single call to predict per group:
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2) %>%
mutate(mpg2 = predict(models[[as.character(cyl)[1]]], newdata = data.frame(disp)))
# # A tibble: 6 x 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.6
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1 23.4
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4 19.7
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4 19.7
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2 14.9
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3 16.5
If you don't want to (or cannot) group, then you need to run one prediction per row. This is expensive in that it runs predict with a single newdata= argument, but ... it still works. To do this, we'll map it:
library(purrr) # map* functions
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2) %>%
ungroup() %>%
mutate(mpg2 = map2_dbl(cyl, disp, ~ predict(models[[as.character(.x)]], newdata = data.frame(disp=.y))))
# # A tibble: 6 x 12
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.6
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1 23.4
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4 19.7
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4 19.7
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2 14.9
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3 16.5
Note that I had to name the column of newdata=data.frame(disp=.y): when we did it before, data.frame(disp) names it the name of the import variable. In this case, .y is not known to the model, so we have to explicitly name it.

Related

Optimize random with filter and map function

I want to randomly retrieve a list of cars mpg based on some predefined fuel type.
Here is the code that works but slows down the processing.
Is there a better way to apply this principle in a data volume containing a million rows?
list_carbs <- c(1,3,4,4)
get_sample_cars <- function (list_carbs){
filtered_cars <- map(list_carbs, ~mtcars %>% filter(carb ==.x))
res <- map(filtered_cars, ~sample(.x$mpg, size=1))
}
mpg_cars <- get_sample_cars(list_carbs)
here are two examples of expected list results:
mpg carb
27.3 1
16.4 3
19.2 4
10.4 4
mpg carb
32.4 1
17.3 3
19.2 4
14.7 4
filter(mtcars, carb %in% list_carbs) %>%
group_by(carb) %>%
slice_sample(n = 1)
# A tibble: 3 x 11
# Groups: carb [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
EDIT:
mtcars %>%
select(carb, mpg) %>%
nest_by(carb) %>%
filter(carb %in% list_carbs) %>%
mutate(data = map2(data, table(list_carbs)[as.character(carb)],
~sample(.x,.y)))%>%
unnest(data)
# A tibble: 4 x 2
# Groups: carb [3]
carb data
<dbl> <dbl>
1 1 22.8
2 3 16.4
3 4 14.3
4 4 14.7
you can probably simplify your code just using this:
mpg_cars <- sample(mtcars$mpg[carb %in% list_carb], size = 3)
that is to say, you can filter your desired column by slicing data in any way you want and sample from the remaining filtered data.

R DPLYR GROUPINGS

library(dplyr)
data(mtcars)
mtcars$FACTORA = sample(c("A", "b"), r=T)
mtcars$FACTORB=sample("c","e")
DATA = mtcars %>%
group_by(FACTORA, FACTORB) %>%
slice(which.min(wt)) &
group_by(FACTORA) %>%
slice(which.min(wt))
I wish to keep rows that MINIMIZE wt by qsec and gear and also keep rows that minimize wt just by qsec all in one data.
or do i have to do this
DATA = mtcars %>%
group_by(FACTORA,FACTORB) %>%
slice(which.min(wt))
DATADATA = mtcars %>%
group_by(FACTORA) %>%
slice(which.min(wt))
and then do merge?
I think this is what you mean (replacing qsec for cyl which is categorical). Note that in this set of groupings the keep2 is a bit extraneous since any row that minimizes wt for each cyl is guaranteed to appear in the rows that minimize wt for each cyl/gear group.
Also, this will only return one minimum and drop ties, though since you use which.min above I figure that isn't important.
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
arrange(wt) %>%
mutate(keep1 = row_number() == 1L) %>%
group_by(cyl) %>%
arrange(wt) %>%
mutate(keep2 = row_number() == 1L) %>%
filter(keep1 | keep2)
#> # A tibble: 8 × 13
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb keep1 keep2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 TRUE TRUE
#> 2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 TRUE FALSE
#> 3 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 TRUE FALSE
#> 4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 TRUE TRUE
#> 5 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 TRUE FALSE
#> 6 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 TRUE TRUE
#> 7 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 TRUE FALSE
#> 8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 TRUE FALSE
Created on 2022-04-29 by the reprex package (v2.0.1)

How do you repeat on filtering datasets and then running regressions without writing out individual code?

How do you repeat on filtering datasets and then running regressions without writing out individual code?
I want to run a linear regression on the mtcars data where the data are all of mtcars, the IV is mtcars$am, and the DV is mtcars$mpg. I then want to use the grouping variable mtcars$gear to create 3 datasets where mtcars$gear is 3, 4, or 5, and then runs the regressions again with these 3 datasets separately.
The long process that I currently used is below.
Unique values of variables of interest:
## variables of interets
unique(mtcars$mpg)
# ---- NOTE: DV is mpg
unique(mtcars$am)
# ---- NOTE: IV is mpg
unique(mtcars$gear)
# ---- NOTE: grouping variable is gear
Here is the baseline code I used for the regression:
## linear regression with all data
lm__am_on_mpg__mtcars <- lm(mpg ~ am, data=mtcars)
summary(lm__am_on_mpg__mtcars)
I then used the filter() command in the tidyverse package to create 3 datasets, where mtcars$gear is 3, 4, or 5
### list of filtered datasets
str(mtcars__gear_is_3)
str(mtcars__gear_is_4)
str(mtcars__gear_is_5)
I then created 3 regressions with the same basic structure as the base regression above, but with different datasets connected with different mtcars$gear levels.
#### when mtcars__gear_is_3 is dataset used
lm__am_on_mpg__mtcars__gear_is_3 <- lm(mpg ~ am, data=mtcars__gear_is_3)
summary(lm__am_on_mpg__mtcars__gear_is_3)
#### when mtcars__gear_is_4 is dataset used
lm__am_on_mpg__mtcars__gear_is_4 <- lm(mpg ~ am, data=mtcars__gear_is_4)
summary(lm__am_on_mpg__mtcars__gear_is_4)
#### when mtcars__gear_is_5 is dataset used
lm__am_on_mpg__mtcars__gear_is_5 <- lm(mpg ~ am, data=mtcars__gear_is_5)
summary(lm__am_on_mpg__mtcars__gear_is_5)
This seems to work, but it also seems to be a lot of code. I feel this could be accomplished with more concise code. I want to know if I can speed this process up by writing code that:
(A) creates different datasets in a shorter way using the tidyverse filter method
(B) creates different regressions in a shorter way that just swaps the dataset names when appropriate
without having to write all of the code the long way.
Here are my questions:
(1) Is this possible to do in R in general?
(2) Is this possible with datasets?
(2.1) If so, how?
(3) Is this possible with regressions?
(3.1) If so, how?
====================
Here is my R code that I used to complete this task the long way
# How do you repeat on filtering datasets and then running regressions in R without writing out individual code?
## dataset of interest
mtcars
### info about dataset
head(mtcars)
str(mtcars)
columns(mtcars)
## variables of interets
unique(mtcars$mpg)
# ---- NOTE: DV is mpg
unique(mtcars$am)
# ---- NOTE: IV is mpg
unique(mtcars$gear)
# ---- NOTE: grouping variable is gear
## linear regression with all data
lm__am_on_mpg__mtcars <- lm(mpg ~ am, data=mtcars)
summary(lm__am_on_mpg__mtcars)
## filter data based on mtcars$gear
### loads tidyverse
library(tidyverse)
### when mtcars$gear == 3
#### creates filtered dataset
# ---- NOTE: starting dataset - mtcars
# ---- NOTE: ending dataset - mtcars__gear_is_3
# ---- NOTE: filter variable - gear
# ---- NOTE: filter variable value(s) - 3
##### starting dataset
str(mtcars)
##### unique values of starting dataset$filter
unique(mtcars$gear)
##### filters data into post-filter dataset
mtcars__gear_is_3 <- filter(mtcars, (gear == "3"))
##### turns post-filter dataset into data frame
mtcars__gear_is_3 <- data.frame(mtcars__gear_is_3)
##### post-filter dataset
str(mtcars__gear_is_3)
##### unique values of post-filter dataset$filter
unique(mtcars__gear_is_3$gear)
### when mtcars$gear == 4
#### creates filtered dataset
# ---- NOTE: starting dataset - mtcars
# ---- NOTE: ending dataset - mtcars__gear_is_4
# ---- NOTE: filter variable - gear
# ---- NOTE: filter variable value(s) - 4
##### starting dataset
str(mtcars)
##### unique values of starting dataset$filter
unique(mtcars$gear)
##### filters data into post-filter dataset
mtcars__gear_is_4 <- filter(mtcars, (gear == "4"))
##### turns post-filter dataset into data frame
mtcars__gear_is_4 <- data.frame(mtcars__gear_is_4)
##### post-filter dataset
str(mtcars__gear_is_4)
##### unique values of post-filter dataset$filter
unique(mtcars__gear_is_4$gear)
### when mtcars$gear == 5
#### creates filtered dataset
# ---- NOTE: starting dataset - mtcars
# ---- NOTE: ending dataset - mtcars__gear_is_5
# ---- NOTE: filter variable - gear
# ---- NOTE: filter variable value(s) - 5
##### starting dataset
str(mtcars)
##### unique values of starting dataset$filter
unique(mtcars$gear)
##### filters data into post-filter dataset
mtcars__gear_is_5 <- filter(mtcars, (gear == "5"))
##### turns post-filter dataset into data frame
mtcars__gear_is_5 <- data.frame(mtcars__gear_is_5)
##### post-filter dataset
str(mtcars__gear_is_5)
##### unique values of post-filter dataset$filter
unique(mtcars__gear_is_5$gear)
## regressions where data is filtered by gear
### list of filtered datasets
str(mtcars__gear_is_3)
str(mtcars__gear_is_4)
str(mtcars__gear_is_5)
#### when mtcars__gear_is_3 is dataset used
lm__am_on_mpg__mtcars__gear_is_3 <- lm(mpg ~ am, data=mtcars__gear_is_3)
summary(lm__am_on_mpg__mtcars__gear_is_3)
#### when mtcars__gear_is_4 is dataset used
lm__am_on_mpg__mtcars__gear_is_4 <- lm(mpg ~ am, data=mtcars__gear_is_4)
summary(lm__am_on_mpg__mtcars__gear_is_4)
#### when mtcars__gear_is_5 is dataset used
lm__am_on_mpg__mtcars__gear_is_5 <- lm(mpg ~ am, data=mtcars__gear_is_5)
summary(lm__am_on_mpg__mtcars__gear_is_5)
May be you will be able to achieve you goal with something like this :
library(data.table)
dt <- as.data.table(mtcars)
formulas <- paste0("lm(mpg ~ am, data = dt[gear == ", unique(dt[,gear]), "])" )
l <- lapply(formulas, function(x) eval(parse(text=x)))
and to see all models, just use :
l
or to see summary of one of the models :
summary(lm[[1]])
I would use dplyr nest(). See more info here.
library(tidyverse,warn.conflict = F)
df <- mtcars %>% # Nest data by gear
group_by(gear) %>%
nest()
df$data
#> [[1]]
#> # A tibble: 12 x 10
#> mpg cyl disp hp drat wt qsec vs am carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 1
#> 4 24.4 4 147. 62 3.69 3.19 20 1 0 2
#> 5 22.8 4 141. 95 3.92 3.15 22.9 1 0 2
#> 6 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
#> 7 17.8 6 168. 123 3.92 3.44 18.9 1 0 4
#> 8 32.4 4 78.7 66 4.08 2.2 19.5 1 1 1
#> 9 30.4 4 75.7 52 4.93 1.62 18.5 1 1 2
#> 10 33.9 4 71.1 65 4.22 1.84 19.9 1 1 1
#> 11 27.3 4 79 66 4.08 1.94 18.9 1 1 1
#> 12 21.4 4 121 109 4.11 2.78 18.6 1 1 2
#>
#> [[2]]
#> # A tibble: 15 x 10
#> mpg cyl disp hp drat wt qsec vs am carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 0 1
#> 2 18.7 8 360 175 3.15 3.44 17.0 0 0 2
#> 3 18.1 6 225 105 2.76 3.46 20.2 1 0 1
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 4
#> 5 16.4 8 276. 180 3.07 4.07 17.4 0 0 3
#> 6 17.3 8 276. 180 3.07 3.73 17.6 0 0 3
#> 7 15.2 8 276. 180 3.07 3.78 18 0 0 3
#> 8 10.4 8 472 205 2.93 5.25 18.0 0 0 4
#> 9 10.4 8 460 215 3 5.42 17.8 0 0 4
#> 10 14.7 8 440 230 3.23 5.34 17.4 0 0 4
#> 11 21.5 4 120. 97 3.7 2.46 20.0 1 0 1
#> 12 15.5 8 318 150 2.76 3.52 16.9 0 0 2
#> 13 15.2 8 304 150 3.15 3.44 17.3 0 0 2
#> 14 13.3 8 350 245 3.73 3.84 15.4 0 0 4
#> 15 19.2 8 400 175 3.08 3.84 17.0 0 0 2
#>
#> [[3]]
#> # A tibble: 5 x 10
#> mpg cyl disp hp drat wt qsec vs am carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26 4 120. 91 4.43 2.14 16.7 0 1 2
#> 2 30.4 4 95.1 113 3.77 1.51 16.9 1 1 2
#> 3 15.8 8 351 264 4.22 3.17 14.5 0 1 4
#> 4 19.7 6 145 175 3.62 2.77 15.5 0 1 6
#> 5 15 8 301 335 3.54 3.57 14.6 0 1 8
mod <- function(x) {
lm(mpg ~ am,data = x) # Create model
}
map(df$data, mod)
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ am, data = x)
#>
#> Coefficients:
#> (Intercept) am
#> 21.050 5.225
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = mpg ~ am, data = x)
#>
#> Coefficients:
#> (Intercept) am
#> 16.11 NA
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = mpg ~ am, data = x)
#>
#> Coefficients:
#> (Intercept) am
#> 21.38 NA
df <- df %>%
mutate(model = map(data,mod))
df[[3]]
#> [[1]]
#>
#> Call:
#> lm(formula = mpg ~ am, data = x)
#>
#> Coefficients:
#> (Intercept) am
#> 21.050 5.225
#>
#>
#> [[2]]
#>
#> Call:
#> lm(formula = mpg ~ am, data = x)
#>
#> Coefficients:
#> (Intercept) am
#> 16.11 NA
#>
#>
#> [[3]]
#>
#> Call:
#> lm(formula = mpg ~ am, data = x)
#>
#> Coefficients:
#> (Intercept) am
#> 21.38 NA
Created on 2021-01-15 by the reprex package (v0.3.0)

R - Making predictions and confidence intervals with different models for each group of data

A very similar question was asked here, but I want to add columns for a confidence interval. Their example that works:
x <- mtcars %>%
group_by(gear) %>%
do(model = lm(mpg ~ hp + wt, data = .))
x
Source: local data frame [3 x 2]
Groups: <by row>
# A tibble: 3 x 2
gear model
* <dbl> <list>
1 3 <S3: lm>
2 4 <S3: lm>
3 5 <S3: lm>
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict)) %>%
unnest(data, preds)
This works, and produces an additional column for mtcars with predicted values made with a separate model for each grouping. Now what I'd like to do, is include confidence interval columns from predict()
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data, predict, interval = "confidence")) %>%
unnest(data, preds)
This returns the error:
Error in vec_rbind(!!!x, .ptype = ptype) : Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
The error is triggered in unnest() in the final line. I think the issue is something related the output format of predict(), which is a 3-column dataframe (fit, upr, lwr). Any help would be appreciated!
Output of predict is a matrix, convert it to a dataframe and then unnest
library(tidyverse)
mtcars %>%
group_by(gear) %>%
nest %>%
inner_join(x) %>%
mutate(preds = map2(model, data,
~as.data.frame(predict(.x, .y, interval = "confidence")))) %>%
unnest(cols = c(preds, data))
# gear mpg cyl disp hp drat wt qsec vs am carb model fit lwr upr
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl> <dbl>
# 1 4 21 6 160 110 3.9 2.62 16.5 0 1 4 <lm> 22.0 19.6 24.4
# 2 4 21 6 160 110 3.9 2.88 17.0 0 1 4 <lm> 21.2 19.2 23.2
# 3 4 22.8 4 108 93 3.85 2.32 18.6 1 1 1 <lm> 25.1 23.0 27.1
# 4 4 24.4 4 147. 62 3.69 3.19 20 1 0 2 <lm> 26.0 21.5 30.6
# 5 4 22.8 4 141. 95 3.92 3.15 22.9 1 0 2 <lm> 22.2 19.9 24.4
# 6 4 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 <lm> 17.8 15.1 20.5
# 7 4 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 <lm> 17.8 15.1 20.5
# 8 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 1 <lm> 28.7 26.6 30.8
# 9 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 2 <lm> 32.3 29.3 35.3
#10 4 33.9 4 71.1 65 4.22 1.84 19.9 1 1 1 <lm> 30.0 27.5 32.5
# … with 22 more rows

R split apply combine with dplyr - how to keep NA resulting from slice

mtcars %>% select(mpg, cyl) %>% group_by(cyl) %>% arrange(mpg) %>% slice(8)
outputs
mpg cyl
<dbl> <dbl>
1 30.4 4
2 15.2 8
As you can see, it does not produce a row for 6 cylinders - what is the recommended way to keep all the groups, even if combine is empty?
To quickly select a row from each group, keeping NAs, you can subset inside summarise_all:
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
summarise_all(funs(.[8]))
## # A tibble: 3 × 11
## cyl mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 2 6 NA NA NA NA NA NA NA NA NA NA
## 3 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
However, #Frank is right above; it won't extend nicely to subsetting to multiple rows in this format because summarise demands a single result row for each group. To subset, say, rows 7 and 8 of each group, use a list column and unnest with tidyr::unnest:
library(tidyverse)
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
summarise_all(funs(list(.[7:8]))) %>%
unnest()
## # A tibble: 6 × 11
## cyl mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
## 2 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 3 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4 6 NA NA NA NA NA NA NA NA NA NA
## 5 8 15.2 275.8 180 3.07 3.780 18.00 0 0 3 3
## 6 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
A more concise version with purrr::dmap returns the same thing:
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
dmap(~.x[7:8])
Since dplyr 0.8 we can use group_map, so with the same idea as #alistaire we can do:
library(dplyr)
mtcars2 <- mtcars %>% select(mpg, cyl) %>% group_by(cyl) %>% arrange(mpg)
mtcars2 %>% group_map(~.[8,])
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 30.4
#> 2 6 NA
#> 3 8 15.2
mtcars2 %>% group_map(~.[7:8,])
#> # A tibble: 6 x 2
#> # Groups: cyl [3]
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 27.3
#> 2 4 30.4
#> 3 6 21.4
#> 4 6 NA
#> 5 8 15.2
#> 6 8 15.2

Resources