Hw can I use arrange in dplyr to order groups? - r

I would like to group data and then arrange the table so that groups with the highest values are shown first. E.g. in mtcars dataset, I would like to group the cars by number of cylinders and then arrange the table so that the groups with the highest mean mpg are shown first
mtcars %>% group_by (cyl) %>% arrange (desc(mean (mpg)))
this produces an error:
Error: incorrect size (1) at position 1, expecting : 32
the reason I am asking is that filter() when applied after group_by() is applied to the whole group, not individual rows.

A good way to do this is to turn the grouping variable into a factor and use reorder (or forcats::fct_reorder) to control the order of the levels. Then you can arrange by that column. (The grouping is implicit in the reorder functions.)
library(dplyr)
mtcars %>%
mutate(
cyl = reorder(factor(cyl), -mpg)
# stats::reorder, built-in, uses mean by default
# use -mpg to make it descending
) %>%
arrange(cyl)
# alternately
library(forcats)
mtcars %>%
mutate(
cyl = fct_reorder(factor(cyl), mpg, .fun = mean, .desc = TRUE)
# forcats::fct_reorder, uses median by default,
# takes a .desc argument to make it descending
) %>%
arrange(cyl)
Changing the data like this is nice because the order you specify will be remembered and used by other functions (like ordering bars or facets in a ggplot).

Perhaps this? First, group by cyl, then fill a new column with mean(mpg), which you can then arrange by however you want, and finally remove the temporary mean(mpg) column.
mtcars %>%
group_by(cyl) %>%
mutate(mean_mpg = mean(mpg)) %>%
arrange(desc(mean_mpg)) %>%
select(-mean_mpg)
#> # A tibble: 32 x 11
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
#> 7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
#> 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
#> 9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
#> 10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
#> # ... with 22 more rows

Related

Assign most common value of factor variable with summarize in R

R noob here, working in tidyverse / RStudio.
I have a categorical / factor variable that I'd like to retain in a group_by/summarize workflow. I'd like to summarize it using a summary function that returns the most common value of that factor within each group.
Is there a summary function I can use for this?
mean returns NA, median only works with numeric data, and summary gives me separate rows with counts of each factor level instead of the most common level.
Edit: example using subset of mtcars dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
Here I have converted carb into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4 and 1 with carb=1; similarly among 4-cylinder cars there are 2 with carb=2 and 1 with carb=1.
So if I do:
data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))
where FUNC is the function I'm looking for, I should get:
cyl carb
<dbl> <fct>
4 2
6 4
8 2 # there are multiple potential ways of handling multi-modal situations, but that's secondary here
Hope that makes sense!
You could use the function fmode of collapse to calculate the mode. Here I created a reproducible example using mtcars dataset where the cyl column is your factor variable to group on like this:
library(dplyr)
library(collapse)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#> cyl mode
#> <fct> <dbl>
#> 1 4 1
#> 2 6 0
#> 3 8 0
Created on 2022-11-24 with reprex v2.0.2
We could use which.max after count:
library(dplyr)
# fake dataset
x <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(cyl)
x %>%
count(cyl) %>%
slice(which.max(n))
cyl n
<fct> <int>
1 8 14
You can use which.max to index and table to count.
library(tidyverse)
mtcars |>
group_by(cyl) |>
summarise(modalcarb = carb[which.max(table(carb))])
#> # A tibble: 3 x 2
#> cyl modalcarb
#> <dbl> <dbl>
#> 1 4 2
#> 2 6 4
#> 3 8 3

How to replace duplicate row values by appending indexes in R using dplyr? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
How can two strings be concatenated?
(12 answers)
Closed 1 year ago.
I would like to replace duplicate row values in a given column by appending an underscore with an index based on their incidence. For example
old_df_col new_df_col
object object_1
object object_2
object object_3
object object_4
Most other questions focus around deleting or replacing duplicate values with NA so I wasn't able to find an implementation using R and dplyr.
Here's what I've worked out so far
# count duplicates
mtcars %>% group_by(carb) %>% summarize(n=n())
# filter duplicates
mtcars %>% group_by(carb) %>% filter(n()>1)
You can group by the target variable and use row_number() to create the sequence.
Clearly, you might have to sort the data set previously (using arrange()) so that the sequence has some meaning for your data, but is not strictly necessary.
library(dplyr)
mtcars %>% group_by(carb) %>%
arrange(carb, cyl, mpg, hp) %>%
mutate(
carb_seq = paste("carb", carb, "seq", row_number(), sep = "_")
)
# A tibble: 32 x 12
# Groups: carb [6]
mpg cyl disp hp drat wt qsec vs am gear carb carb_seq
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 carb_1_seq_1
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 carb_1_seq_2
3 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 carb_1_seq_3
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 carb_1_seq_4
5 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 carb_1_seq_5
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 carb_1_seq_6
7 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 carb_1_seq_7
8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 carb_2_seq_1
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 carb_2_seq_2
10 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 carb_2_seq_3
# … with 22 more rows
Created on 2021-07-11 by the reprex package (v2.0.0)

dplyr - mutate using function that uses other column data as argument?

I have a list with 3 regression models, called logregs. My data has a column called type that only has integers 1, 2, and 3, which are used to decide which regression model from logregs should be used, and a column called adstock which is the only independent variable used in the regression models.
I'm trying to do something like:
dataframe %>% mutate(probability = predict(logregs[[type]], type = "prediction", newdata = adstock) )
Sample data frame:
structure(list(type = c(3L, 3L, 3L, 3L, 3L, 3L), adstock = c(1.7984,
1.7984, 2.7984, 6.7984, 6.5968, 4.992)), row.names = c(NA, 6L
), class = "data.frame")
(unfortunately, the logregs models are too large to dput here)
How is this achievable using dplyr?
Yes, but you need to take some more care on subsetting logregs, and use data.frame on your newdata=.
I'll generate a quick set of models based on mtcars.
library(dplyr)
models <- mtcars %>%
group_by(cyl = as.character(cyl)) %>%
nest() %>%
mutate(mdl = map(data, ~ lm(mpg ~ disp, data = .x))) %>%
arrange(cyl) %>%
select(cyl, mdl) %>%
deframe()
models
# $`4`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 40.8720 -0.1351
# $`6`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 19.081987 0.003605
# $`8`
# Call:
# lm(formula = mpg ~ disp, data = .x)
# Coefficients:
# (Intercept) disp
# 22.03280 -0.01963
Note that they are indexed on the character of the number of cylinders, since otherwise numeric indexing can be confusing.
Let's modify the mtcars$disp a little and to use it again:
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2)
# # A tibble: 6 x 11
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3
The [[ indexing on your logregs expects a single type, but you're actually passing a vector. Since my data here is still grouped, I can go with the first of the group variable (cyl) and do just a single call to predict per group:
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2) %>%
mutate(mpg2 = predict(models[[as.character(cyl)[1]]], newdata = data.frame(disp)))
# # A tibble: 6 x 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.6
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1 23.4
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4 19.7
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4 19.7
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2 14.9
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3 16.5
If you don't want to (or cannot) group, then you need to run one prediction per row. This is expensive in that it runs predict with a single newdata= argument, but ... it still works. To do this, we'll map it:
library(purrr) # map* functions
set.seed(42)
mtcars %>%
mutate(disp = disp + sample(20, size=n(), replace = TRUE) - 10) %>%
group_by(cyl) %>%
sample_n(2) %>%
ungroup() %>%
mutate(mpg2 = map2_dbl(cyl, disp, ~ predict(models[[as.character(.x)]], newdata = data.frame(disp=.y))))
# # A tibble: 6 x 12
# mpg cyl disp hp drat wt qsec vs am gear carb mpg2
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 30.6
# 2 21.5 4 129. 97 3.7 2.46 20.0 1 0 3 1 23.4
# 3 21 6 169 110 3.9 2.62 16.5 0 1 4 4 19.7
# 4 19.2 6 173. 123 3.92 3.44 18.3 1 0 4 4 19.7
# 5 18.7 8 363 175 3.15 3.44 17.0 0 0 3 2 14.9
# 6 16.4 8 281. 180 3.07 4.07 17.4 0 0 3 3 16.5
Note that I had to name the column of newdata=data.frame(disp=.y): when we did it before, data.frame(disp) names it the name of the import variable. In this case, .y is not known to the model, so we have to explicitly name it.

With dplyr and group_by, is there a way to reference the original (full) dataset? [duplicate]

This question already has answers here:
Assign intermediate output to temp variable as part of dplyr pipeline
(6 answers)
Closed 3 years ago.
QUESTION: Is there a way to reference the original dataset OR (preferably) the dataset from the chain, right before the group_by() at all?
nrow(mtcars)
32 (but we all knew that)
> mtcars %>% group_by(cyl) %>% summarise(count = n())
# A tibble: 3 x 2
cyl count
<dbl> <int>
1 4 11
2 6 7
3 8 14
Great.
mtcars %>%
group_by(cyl) %>%
summarise(count = n(),
prop = n()/SOMETHING)
I understand I could put nrow(mtcars) in there, but this is just a MRE. That's not an option in more complex chain of operations.
Edit: I oversimplified the MRE. I am aware of the "." but I actually wanted to be able to pass the interim tibble off to another function (within the call to summarise), so the assign solution below does exactly what I was after. Thanks.
We can use add_count to count the number and create a new column of the original data frame. If we need more complex operation, we can further use mutate from there.
library(dplyr)
library(tidyr)
mtcars %>%
group_by(cyl) %>%
add_count()
# # A tibble: 32 x 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 7
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 7
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 11
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 14
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 14
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 11
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 11
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 7
# # ... with 22 more rows
You are after the ".":
mtcars %>%
group_by(cyl) %>%
summarise(count = n(),
prop = n()/nrow(.)) %>%
ungroup()

Select top rows in R using add_tally and top_n functions

I would like to select the top n rows in a data frame for which I
calculated a column n that represents the sum of a variable. For example,
using the mtcars data, I would like to filter to keep only the two cyl
with the greatest sum of mpg. In the following example, I was expecting
to select all rows where cyl == 4 and cyl == 8. It must be simple, but
I can not figure out my mistake.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(sum(mpg))
#> # A tibble: 3 x 2
#> cyl `sum(mpg)`
#> <dbl> <dbl>
#> 1 4 293.
#> 2 6 138.
#> 3 8 211.
mtcars %>%
group_by(cyl) %>% # Calculate the sum of mpg for each cyl
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
top_n(2, n)
#> # A tibble: 11 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb n
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 293.
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 293.
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 293.
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 293.
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 293.
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 293.
#> 7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 293.
#> 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 293.
#> 9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 293.
#> 10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 293.
#> 11 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 293.
Created on 2019-07-26 by the reprex package (v0.3.0)
It seems that top_n returns the top n rows after ordering the dataframe and returns more than n rows if there are ties. It does not return rows with distinct top n values.
From documentation -
Usage
top_n(x, n, wt)
Arguments
x: a tbl() to filter
n: number of rows to return. If x is grouped,
this is the number of rows per group. Will include more than n rows if
there are ties. If n is positive, selects the top n rows. If negative,
selects the bottom n rows.
You need, as suggested by #tmfmnk -
mtcars %>%
group_by(cyl) %>%
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3)

Resources