R tidy adding a summary by group - r

I am trying to add a column with a summary statistic calculated by group using group_by and mutate(). I've done it many times before w/o any issues, but the case is no longer. Now the new column contains the summary of the entire data set and not by group:
> mtcars %>% group_by(cyl) %>% mutate(mean = mean(disp))
# A tibble: 32 x 12
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb mean
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 231.
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 231.
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 231.
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 231.
...
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 231.
The 231 in the last column is the overall mean of disp. I'd expect the last column to contain group means, e.g. 183 for cyl=6 and 105 for cyl=4, etc. What am I doing wrong here?

Related

Assign most common value of factor variable with summarize in R

R noob here, working in tidyverse / RStudio.
I have a categorical / factor variable that I'd like to retain in a group_by/summarize workflow. I'd like to summarize it using a summary function that returns the most common value of that factor within each group.
Is there a summary function I can use for this?
mean returns NA, median only works with numeric data, and summary gives me separate rows with counts of each factor level instead of the most common level.
Edit: example using subset of mtcars dataset:
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147. 62 3.69 3.19 20 1 0 4 2
22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
Here I have converted carb into a factor variable. In this subset of the data, you can see that among 6-cylinder cars there are 3 with carb=4 and 1 with carb=1; similarly among 4-cylinder cars there are 2 with carb=2 and 1 with carb=1.
So if I do:
data %>% group_by(cyl) %>% summarise(modalcarb = FUNC(carb))
where FUNC is the function I'm looking for, I should get:
cyl carb
<dbl> <fct>
4 2
6 4
8 2 # there are multiple potential ways of handling multi-modal situations, but that's secondary here
Hope that makes sense!
You could use the function fmode of collapse to calculate the mode. Here I created a reproducible example using mtcars dataset where the cyl column is your factor variable to group on like this:
library(dplyr)
library(collapse)
mtcars %>%
mutate(cyl = as.factor(cyl)) %>%
group_by(cyl) %>%
summarise(mode = fmode(am))
#> # A tibble: 3 × 2
#> cyl mode
#> <fct> <dbl>
#> 1 4 1
#> 2 6 0
#> 3 8 0
Created on 2022-11-24 with reprex v2.0.2
We could use which.max after count:
library(dplyr)
# fake dataset
x <- mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(cyl)
x %>%
count(cyl) %>%
slice(which.max(n))
cyl n
<fct> <int>
1 8 14
You can use which.max to index and table to count.
library(tidyverse)
mtcars |>
group_by(cyl) |>
summarise(modalcarb = carb[which.max(table(carb))])
#> # A tibble: 3 x 2
#> cyl modalcarb
#> <dbl> <dbl>
#> 1 4 2
#> 2 6 4
#> 3 8 3

Create list-column of dataframes with cur_data() in R. Problems with accessing data

I do the following:
mtcars %>%
group_by(cyl) %>%
summarise(
model = list(lm(disp ~ mpg, data = cur_data())),
data = list(dat = cur_data())
) -> df
However, when I want to access the list-column data, it gives me this error:
> df$data
$dat
Error: Can't subset elements that don't exist.
x Locations 2, 3, 4, 5, 6, etc. don't exist.
ℹ There are only 1 element.
While the actual glimpse looks like this:
> glimpse(df)
Rows: 3
Columns: 3
$ cyl <dbl> 4, 6, 8
$ model <list> [<233.067448, -4.797961, -15.673940, 30.702798, 17.126060, 1.086485, -11.509437, 0.68342…
$ data <named list> [<tbl_df[11 x 11]>, <tbl_df[7 x 11]>, <tbl_df[14 x 11]>]
Not really sure what is going wrong here...
Change the order of operation.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(
data = list(dat = cur_data()),
model = list(lm(disp ~ mpg, data = cur_data())),
) -> df
df$data
#$dat
# A tibble: 11 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 108 93 3.85 2.32 18.6 1 1 4 1
# 2 24.4 147. 62 3.69 3.19 20 1 0 4 2
# 3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
# 4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 5 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
#...
#...
#$dat
# A tibble: 7 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 21 160 110 3.9 2.62 16.5 0 1 4 4
#2 21 160 110 3.9 2.88 17.0 0 1 4 4
#3 21.4 258 110 3.08 3.22 19.4 1 0 3 1
#4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
#5 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
#6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4
#7 19.7 145 175 3.62 2.77 15.5 0 1 5 6
#$dat
# A tibble: 14 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 18.7 360 175 3.15 3.44 17.0 0 0 3 2
# 2 14.3 360 245 3.21 3.57 15.8 0 0 3 4
# 3 16.4 276. 180 3.07 4.07 17.4 0 0 3 3
# 4 17.3 276. 180 3.07 3.73 17.6 0 0 3 3
# 5 15.2 276. 180 3.07 3.78 18 0 0 3 3
#...
I don't know the exact reason for this error but my guess is that after doing model = list(lm(disp ~ mpg, data = cur_data())) cur_data() now consists of current grouped dataframe as well as the model which is causing issues in storing the data.

How do I specify a range of columns in a case_when statemet to check a condition in R?

Given the tibble -
library(tidyverse)
df <- mtcars %>% as_tibble() %>% slice(1:5)
df
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
I know you can use something like df %>% select(c(mpg:vs)) to select a range of columns without typing all the column names in the select statement. How do I do something similar in a case_when statement. I have a dataset with about 35 columns and want to flag rows where all columns are equal to 0.
We can use where condition in select
df %>%
select(where(~ all(. %in% c(0, 1))))
-output
# A tibble: 5 x 2
vs am
<dbl> <dbl>
1 0 1
2 0 1
3 1 1
4 1 0
5 0 0
If we want to create a new column 'flag' that checks if all the column values are 0 for a particular row
df %>%
mutate(new = !rowSums(cur_data() != 0))
I'm not sure if we can use case_when about this problem, but we can use the following solution:
library(dplyr)
df %>%
rowwise() %>%
mutate(flag = +(all(c_across(where(is.numeric)) == 0)))
# A tibble: 5 x 12
# Rowwise:
mpg cyl disp hp drat wt qsec vs am gear carb flag
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 0
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 0
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 0
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0

unnest_longer gives dollar sign instead of normal tibble

can anybody explain the following difference:
library(tidyverse)
tribble(~id,
c(1:10))%>%
unnest_longer(id)%>%
mutate(data = map(.x = id, ~mtcars))%>%
unnest_longer(data)
gives:
# A tibble: 320 x 2
id data$mpg $cyl $disp $hp $drat $wt $qsec $vs $am $gear $carb
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 1 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
whereas
library(tidyverse)
tribble(~id,
c(1:10))%>%
unnest_longer(id)%>%
mutate(data = map(.x = id, ~mtcars))%>%
unnest(data)
gives the result I want.
# A tibble: 320 x 12
id mpg cyl disp hp drat wt qsec vs am gear carb
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 1 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
Why are there $-Signs in the 1st example of code?
Thanks in advance!
This is a feature introduced in {tibble} version 2.0.1 - docs.
To get the same results as tidyr::unnest(), you'll need to add tidyr::unpack().
tribble(
~id,
c(1:10)
) %>%
tidyr::unnest_longer(id) %>%
dplyr::mutate(data = purrr::map(
.x = id,
~ mtcars
)) %>%
tidyr::unnest_longer(data) %>%
tidyr::unpack(cols = data)

Filtering data frame by condition including data after that condition

Is there an easy way to filter my data frame so that any rows after and including a row that follows some condition are filtered out? The issue here is that I want it to be robust enough to handle a case where that condition is not met, in which the whole data frame will be returned. Check out my examples below if that sounds confusing:
library(dplyr)
## Works
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mpg == 17.8)))
#> # A tibble: 11 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 11 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
## Doesn't work
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mpg == 30.5)))
#> Error in filter_impl(.data, quo): Evaluation error: Expecting a single value: [extent=0]..
Created on 2018-08-12 by the reprex package (v0.2.0).
You could include an ifelse statement to check whether the value is present in the dataframe. Also, you need to select the first row where the condition is verified to account for cases where the value is present more than once (in your example 21.0)
library(dplyr)
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1,ifelse(!any(mpg == 30),n(),which(mpg == 30)[1]-1)))
## returns the whole tibble
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1,ifelse(!any(mpg == 21),n(),which(mpg == 21)[1]-1)))
## Returns a tibble with 0 rows
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1,ifelse(!any(mpg == 21.4),n(),which(mpg == 21.4)[1]-1)))
## returns:
# A tibble: 3 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
I think your specific example does not work because there is no mpg that equals 30.5, however, you get the same error with mpg equals 21.0 because there are two rows with that value. You will need to chose whether you want the first or the last instance of that condition
library(tidyverse)
#max row
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mtcars$mpg == 21.0)[length(which(mtcars$mpg == 21.0))]))
#> # A tibble: 2 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
or
#min row
mtcars %>%
as_tibble() %>%
filter(between(row_number(), 1, which(mtcars$mpg == 21.0)[1]))
#> # A tibble: 1 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
The example I chose just happened to be rows 1 and 2, but it illustrates the idea.
EDIT
The other answer by Lamia is much more elegant, and I probably thought about this too hard, but I felt like I needed to come up with something
library(dplyr)
filter_if_condition <- function(.data, condition, yes){
test_cond <- enquo(condition)
yes_filter <- enquo(yes)
if(.data %>% filter(!!test_cond) %>% nrow() > 0){
.data %>% filter(!!yes_filter)
}
else{.data}
}
mtcars %>%
as_tibble() %>%
filter_if_condition(366.0 %in% mpg, between(row_number(), 1, which(mpg == 366)[1]))
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
mtcars %>%
as_tibble() %>%
filter_if_condition(18.1 %in% mpg, between(row_number(), 1, which(mpg == 18.1)[1]))
#> # A tibble: 6 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1

Resources