R DPLYR GROUPINGS - r

library(dplyr)
data(mtcars)
mtcars$FACTORA = sample(c("A", "b"), r=T)
mtcars$FACTORB=sample("c","e")
DATA = mtcars %>%
group_by(FACTORA, FACTORB) %>%
slice(which.min(wt)) &
group_by(FACTORA) %>%
slice(which.min(wt))
I wish to keep rows that MINIMIZE wt by qsec and gear and also keep rows that minimize wt just by qsec all in one data.
or do i have to do this
DATA = mtcars %>%
group_by(FACTORA,FACTORB) %>%
slice(which.min(wt))
DATADATA = mtcars %>%
group_by(FACTORA) %>%
slice(which.min(wt))
and then do merge?

I think this is what you mean (replacing qsec for cyl which is categorical). Note that in this set of groupings the keep2 is a bit extraneous since any row that minimizes wt for each cyl is guaranteed to appear in the rows that minimize wt for each cyl/gear group.
Also, this will only return one minimum and drop ties, though since you use which.min above I figure that isn't important.
library(dplyr)
mtcars %>%
group_by(cyl, gear) %>%
arrange(wt) %>%
mutate(keep1 = row_number() == 1L) %>%
group_by(cyl) %>%
arrange(wt) %>%
mutate(keep2 = row_number() == 1L) %>%
filter(keep1 | keep2)
#> # A tibble: 8 × 13
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb keep1 keep2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#> 1 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 TRUE TRUE
#> 2 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 TRUE FALSE
#> 3 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 TRUE FALSE
#> 4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 TRUE TRUE
#> 5 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 TRUE FALSE
#> 6 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4 TRUE TRUE
#> 7 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 TRUE FALSE
#> 8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2 TRUE FALSE
Created on 2022-04-29 by the reprex package (v2.0.1)

Related

Assign most common value of factor variable with summarize in R

R noob here, working in tidyverse / RStudio.
I have a categorical / factor variable that I'd like to retain in a group_by/summarize workflow. I'd like to summarize it using a summary function that returns the most common value of that factor within each group.
Is there a summary function I can use for this?
mean returns NA, median only works with numeric data, and summary gives me separate rows with counts of each factor level instead of the most common level.
Edit: example using subset of mtcars dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
Here I have converted carb into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4 and 1 with carb=1; similarly among 4-cylinder cars there are 2 with carb=2 and 1 with carb=1.
So if I do:
data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))
where FUNC is the function I'm looking for, I should get:
cyl carb
<dbl> <fct>
4 2
6 4
8 2 # there are multiple potential ways of handling multi-modal situations, but that's secondary here
Hope that makes sense!
You could use the function fmode of collapse to calculate the mode. Here I created a reproducible example using mtcars dataset where the cyl column is your factor variable to group on like this:
library(dplyr)
library(collapse)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#> cyl mode
#> <fct> <dbl>
#> 1 4 1
#> 2 6 0
#> 3 8 0
Created on 2022-11-24 with reprex v2.0.2
We could use which.max after count:
library(dplyr)
# fake dataset
x <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(cyl)
x %>%
count(cyl) %>%
slice(which.max(n))
cyl n
<fct> <int>
1 8 14
You can use which.max to index and table to count.
library(tidyverse)
mtcars |>
group_by(cyl) |>
summarise(modalcarb = carb[which.max(table(carb))])
#> # A tibble: 3 x 2
#> cyl modalcarb
#> <dbl> <dbl>
#> 1 4 2
#> 2 6 4
#> 3 8 3

dplyr - mutate using function that uses other column data as argument?

I have a list with 3 regression models, called logregs. My data has a column called type that only has integers 1, 2, and 3, which are used to decide which regression model from logregs should be used, and a column called adstock which is the only independent variable used in the regression models.
I'm trying to do something like:
dataframe %>% mutate(probability = predict(logregs[[type]], type = "prediction", newdata = adstock) )
Sample data frame:
structure(list(type = c(3L, 3L, 3L, 3L, 3L, 3L), adstock = c(1.7984,
1.7984, 2.7984, 6.7984, 6.5968, 4.992)), row.names = c(NA, 6L
), class = "data.frame")
(unfortunately, the logregs models are too large to dput here)
How is this achievable using dplyr?
Yes, but you need to take some more care on subsetting logregs, and use data.frame on your newdata=.
I'll generate a quick set of models based on mtcars.
library(dplyr)
models <- mtcars %>%
group_by(cyl = as.character(cyl)) %>%
nest() %>%
mutate(mdl = map(data, ~ lm(mpg ~ disp, data = .x))) %>%
arrange(cyl) %>%
select(cyl, mdl) %>%
deframe()
models
# $`4`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 40.8720 -0.1351
# $`6`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 19.081987 0.003605
# $`8`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 22.03280 -0.01963
Note that they are indexed on the character of the number of cylinders, since otherwise numeric indexing can be confusing.
Let's modify the mtcars$disp a little and to use it again:
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2)
# # A tibble: 6 x 11
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3
The [[ indexing on your logregs expects a single type, but you're actually passing a vector. Since my data here is still grouped, I can go with the first of the group variable (cyl) and do just a single call to predict per group:
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2) %>%
mutate(mpg2 = predict(models[[as.character(cyl)[1]]], newdata = data.frame(disp)))
# # A tibble: 6 x 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.6
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1 23.4
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4 19.7
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4 19.7
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2 14.9
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3 16.5
If you don't want to (or cannot) group, then you need to run one prediction per row. This is expensive in that it runs predict with a single newdata= argument, but ... it still works. To do this, we'll map it:
library(purrr) # map* functions
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2) %>%
ungroup() %>%
mutate(mpg2 = map2_dbl(cyl, disp, ~ predict(models[[as.character(.x)]], newdata = data.frame(disp=.y))))
# # A tibble: 6 x 12
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.6
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1 23.4
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4 19.7
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4 19.7
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2 14.9
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3 16.5
Note that I had to name the column of newdata=data.frame(disp=.y): when we did it before, data.frame(disp) names it the name of the import variable. In this case, .y is not known to the model, so we have to explicitly name it.

Select top rows in R using add_tally and top_n functions

I would like to select the top n rows in a data frame for which I
calculated a column n that represents the sum of a variable. For example,
using the mtcars data, I would like to filter to keep only the two cyl
with the greatest sum of mpg. In the following example, I was expecting
to select all rows where cyl == 4 and cyl == 8. It must be simple, but
I can not figure out my mistake.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(sum(mpg))
#> # A tibble: 3 x 2
#> cyl `sum(mpg)`
#> <dbl> <dbl>
#> 1 4 293.
#> 2 6 138.
#> 3 8 211.
mtcars %>%
group_by(cyl) %>% # Calculate the sum of mpg for each cyl
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
top_n(2, n)
#> # A tibble: 11 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb n
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 293.
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 293.
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 293.
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 293.
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 293.
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 293.
#> 7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 293.
#> 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 293.
#> 9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 293.
#> 10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 293.
#> 11 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 293.
Created on 2019-07-26 by the reprex package (v0.3.0)
It seems that top_n returns the top n rows after ordering the dataframe and returns more than n rows if there are ties. It does not return rows with distinct top n values.
From documentation -
Usage
top_n(x, n, wt)
Arguments
x: a tbl() to filter
n: number of rows to return. If x is grouped,
this is the number of rows per group. Will include more than n rows if
there are ties. If n is positive, selects the top n rows. If negative,
selects the bottom n rows.
You need, as suggested by #tmfmnk -
mtcars %>%
group_by(cyl) %>%
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3)

Group in a loop in the tidyverse

Can I group in a loop in the tidyverse?
The bigger task is to replace a grouping variable with NA if there are few observations in the group. I want to consolidate small groups into an NA group.
However, the code below won't let me group_by(x) where x is the looping variable.
library(tidyverse)
for (x in c("cyl", "gear")) {
mtcars %>%
add_count(x) %>%
mutate(x = ifelse(n() < 10, NA, x))
}
I receive the following error.
Error in grouped_df_impl(data, unname(vars), drop) :
Column `x` is unknown
Do you mean something like this?
library(dplyr)
for (x in c("cyl", "gear")) {
col <- sym(x)
mtcars <- mtcars %>%
add_count(!!col) %>%
mutate(!!col := ifelse(n < 10, NA, !!col)) %>%
select(-n)
}
mtcars
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 NA 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 NA 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 NA 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 NA 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Created on 2018-12-08 by the reprex package (v0.2.1)
(Not the easiest syntax, I know....)
You could also use mutate_at with table
library(tidyverse)
mtcars %>%
mutate_at(vars(cyl, gear), ~ {
t <- table(.)
ifelse(. %in% names(t[t < 10]), NA, .)})
The function can be simplified to one line with purrr::keep
mtcars %>%
mutate_at(vars(cyl, gear),
~ ifelse(. %in% names(keep(table(.), `<`, 10)), NA, .))
Or if you happen to be working with a data.table, you can use an "update join" to subset to groups with low counts, then assign NA to that subset
library(data.table)
dt <- as.data.table(mtcars)
for(x in c('cyl', 'gear'))
dt[dt[, .N, x][N < 10], on = x, (x) := NA]
This will achieve the same result
all.equal(
dt,
mtcars %>%
mutate_at(vars(cyl, gear),
~ ifelse(. %in% names(keep(table(.), `<`, 10)), NA, .)) %>%
setDT
)
# [1] TRUE

R split apply combine with dplyr - how to keep NA resulting from slice

mtcars %>% select(mpg, cyl) %>% group_by(cyl) %>% arrange(mpg) %>% slice(8)
outputs
mpg cyl
<dbl> <dbl>
1 30.4 4
2 15.2 8
As you can see, it does not produce a row for 6 cylinders - what is the recommended way to keep all the groups, even if combine is empty?
To quickly select a row from each group, keeping NAs, you can subset inside summarise_all:
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
summarise_all(funs(.[8]))
## # A tibble: 3 × 11
## cyl mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 2 6 NA NA NA NA NA NA NA NA NA NA
## 3 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
However, #Frank is right above; it won't extend nicely to subsetting to multiple rows in this format because summarise demands a single result row for each group. To subset, say, rows 7 and 8 of each group, use a list column and unnest with tidyr::unnest:
library(tidyverse)
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
summarise_all(funs(list(.[7:8]))) %>%
unnest()
## # A tibble: 6 × 11
## cyl mpg disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1
## 2 4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 3 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4 6 NA NA NA NA NA NA NA NA NA NA
## 5 8 15.2 275.8 180 3.07 3.780 18.00 0 0 3 3
## 6 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
A more concise version with purrr::dmap returns the same thing:
mtcars %>% group_by(cyl) %>%
arrange(mpg) %>%
dmap(~.x[7:8])
Since dplyr 0.8 we can use group_map, so with the same idea as #alistaire we can do:
library(dplyr)
mtcars2 <- mtcars %>% select(mpg, cyl) %>% group_by(cyl) %>% arrange(mpg)
mtcars2 %>% group_map(~.[8,])
#> # A tibble: 3 x 2
#> # Groups: cyl [3]
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 30.4
#> 2 6 NA
#> 3 8 15.2
mtcars2 %>% group_map(~.[7:8,])
#> # A tibble: 6 x 2
#> # Groups: cyl [3]
#> cyl mpg
#> <dbl> <dbl>
#> 1 4 27.3
#> 2 4 30.4
#> 3 6 21.4
#> 4 6 NA
#> 5 8 15.2
#> 6 8 15.2

Resources