Iterating over values and variable names in dplyr::summarise - r

I'm using the following script to make a table in R:
library(dplyr)
library(tidyr)
get_probability <- function(parameter_array, threshold) {
return(round(100 * sum(parameter_array >= threshold) /
length(parameter_array)))
}
thresholds = c(75, 100, 125)
mtcars %>% group_by(gear) %>%
dplyr::summarise(
low=get_probability(disp, thresholds[[1]]),
medium=get_probability(disp, thresholds[[2]]),
high=get_probability(disp, thresholds[[3]]),
)
The table that comes out is the following:
# A tibble: 3 x 4
gear low medium high
<dbl> <dbl> <dbl> <dbl>
1 3 100 100 93
2 4 92 67 50
3 5 100 80 60
My question is, how can condense what I have passed to summarise to a single line? i.e., is there a way to iterate over both the thresholds vector, also while passing custom variable names?

In recent versions of dplyr, summarise will auto-splice data.frames created within it into new columns. So, you just need a way to iterate over thresholds to create a data.frame.
One option is purrr:::map_dfc.
library(dplyr, warn.conflicts = FALSE)
get_probability <- function(parameter_array, threshold) {
return(round(100 * sum(parameter_array >= threshold) /
length(parameter_array)))
}
thresholds = c(75, 100, 125)
thresholds <- setNames(thresholds, c('low', 'medium', 'high'))
mtcars %>%
group_by(gear) %>%
summarise(purrr::map_dfc(thresholds, ~ get_probability(disp, .x)))
#> # A tibble: 3 × 4
#> gear low medium high
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 100 100 93
#> 2 4 92 67 50
#> 3 5 100 80 60
If you prefer not to use an extra package though, you could just lapply and then convert the output to data.frame. (Replace \(x) with function(x) in older versions of R)
mtcars %>%
group_by(gear) %>%
summarise(as.data.frame(lapply(thresholds, \(x) get_probability(disp, x))))
#> # A tibble: 3 × 4
#> gear low medium high
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 100 100 93
#> 2 4 92 67 50
#> 3 5 100 80 60
Created on 2021-08-17 by the reprex package (v2.0.1)

Related

Select rows within an overlapping range based on another column in R tidyverse

I have a data frame that looks like this:
the col1 defines the start of a range when the direction is " + " while the col2 establishes the beginning of a range when the direction is " - ".
library(tidyverse)
df <- tibble(col1=c(1,10,100,40,1000), col2=c(15,20,50,80,2000),
direction=c("+","+","-","+","+"), score=c(50,100,300,10,300))
df
#> # A tibble: 5 × 4
#> col1 col2 direction score
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 15 + 50
#> 2 10 20 + 100
#> 3 100 50 - 300
#> 4 40 80 + 10
#> 5 1000 2000 + 300
Created on 2022-07-28 by the reprex package (v2.0.1)
By considering the direction, I want to extract from the rows with overlapping ranges the ones with the highest score.
I want my data to look like this.
#> col1 col2 direction score
#> <dbl> <dbl> <chr> <dbl>
#> 1 10 20 + 100
#> 3 100 50 - 300
#> 5 1000 2000 + 300
Any ideas and help are highly appreciated.
We could use slice_max after grouping by rleid on the 'direction'
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(direction)) %>%
slice_max(n = 1, order_by = score) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 3 × 4
col1 col2 direction score
<dbl> <dbl> <chr> <dbl>
1 10 20 + 100
2 100 50 - 300
3 1000 2000 + 300

Calculate total elapsed time

I have the following data:
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
x = c(30, 60, 90, 30, 60, 90),
phase = c(rep(c("phase 1", "phase 2"), each = 3))
)
df
#> # A tibble: 6 x 2
#> x phase
#> <dbl> <chr>
#> 1 30 phase 1
#> 2 60 phase 1
#> 3 90 phase 1
#> 4 30 phase 2
#> 5 60 phase 2
#> 6 90 phase 2
Created on 2020-08-11 by the reprex package (v0.3.0)
Where x is the elapsed time (in seconds) within each phase. Since phase is something that happens continuously, I am interested in calculating the total elapsed time.
Desired output:
#> # A tibble: 6 x 3
#> x phase elapsed_time
#> <dbl> <chr> <dbl>
#> 1 30 phase 1 30
#> 2 60 phase 1 60
#> 3 90 phase 1 90
#> 4 30 phase 2 120
#> 5 60 phase 2 150
#> 6 90 phase 2 180
Any ideas? Please, note that my real example has much more phases.
I believe the following post has the answer you're looking for:
How to add a cumulative column to an R dataframe using dplyr?
It shows how to create a cumulative column using group_by and mutate. It appears you want the elapsed time to sum across both phases, so simply do not include the group_by call in your code.
Here is an idea via dplyr. First we need to group by the phase and get the time differences for each. We then ungroup() and take the cumsum() as a total, i.e.
library(dplyr)
df %>%
group_by(phase) %>%
mutate(diffs = x - lag(x, default = 0)) %>%
ungroup() %>%
mutate(res = cumsum(diffs)) %>%
select(-diffs)
# A tibble: 6 x 3
# x phase res
# <dbl> <chr> <dbl>
#1 30 phase 1 30
#2 60 phase 1 60
#3 90 phase 1 90
#4 30 phase 2 120
#5 60 phase 2 150
#6 90 phase 2 180
Here is another dplyr solution. It finds the start of each phase, and adds this to x
library(tidyverse)
df <- tibble(
x = c(30, 60, 90, 30, 60, 90),
phase = c(rep(c("phase 1", "phase 2"), each = 3))
)
df %>% group_by(phase) %>%
nest() %>%
mutate(start = map_dbl(data, max)) %>%
ungroup() %>%
mutate(start = lag(start, default = 0)) %>%
unnest(data) %>%
mutate(elapsed_time = start + x) %>%
select(-start)
#> # A tibble: 6 x 3
#> phase x elapsed_time
#> <chr> <dbl> <dbl>
#> 1 phase 1 30 30
#> 2 phase 1 60 60
#> 3 phase 1 90 90
#> 4 phase 2 30 120
#> 5 phase 2 60 150
#> 6 phase 2 90 180
Created on 2020-08-11 by the reprex package (v0.3.0)

Using Filter function in R. Need to assign NA and keep length of dataset the same for Horse Racing Database

I'm still new to the group and R.
I had some really helpful feedback on my last query so hoping I can get
some more support with the following:
I am working on a horse racing database that at this stage has 4 variables:
race horse number, race id, distance of race and the rating (DaH) assigned for the horses
performance for the race.
The dataset:
horse_ratings <- tibble(
horse=c(1,1,1,2,2,2,3,3,3),
raceid=c(1,2,3,1,2,3,1,2,3),
Dist=c(9.47,9.47,10,10.1,10.2,9,11,9.47,10.5),
DaH=c(101,99,103,101,94,87,102,96,62)
)
Giving:
> horse_ratings
# A tibble: 9 x 4
horse raceid Dist DaH
<dbl> <dbl> <dbl> <dbl>
1 1 1 9.47 101
2 1 2 9.47 99
3 1 3 10 103
4 2 1 10.1 101
5 2 2 10.2 94
6 2 3 9 87
7 3 1 11 102
8 3 2 9.47 96
9 3 3 10.5 62
I will perform a number of calculations on the dataset such as mean rating, max rating etc
which id like to result in a number of vectors of equal length.
I'm using the filter function to look at the performance ratings achieved for different
race distances (ie. Distance greater than 10 to begin). However, if one of the horses has not
run a race for that distance then i've noticed that the result does not include that
horse in the output. ie:
> horse_ratings %>%
+ group_by(horse) %>%
+ filter(Dist>10) %>%
+ summarise(mean_rating=mean(DaH))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 2 97.5
2 3 82
So horse 1 has disappeared as it has not run a race of distance greater than 10.
I need to keep the output vector of length 3 ideally so I can put all the calculations
in to a dataframe of same length (for my final data output/print out).
I'm hoping there's a way of assigning an NA or similar to an output for horse 1
Giving:
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 1 NA
2 2 97.5
3 3 82
Or a similar solution.
Help would be much appreciated!!
You can use the .drop = FALSE parameter in group_by():
horse_ratings %>%
group_by(horse, .drop = FALSE) %>%
filter(Dist > 10) %>%
summarise(mean_rating = mean(DaH))
horse mean_rating
<dbl> <dbl>
1 1 NaN
2 2 97.5
3 3 82
Don't filter first, do it in summarise so you don't drop groups (horse).
library(dplyr)
horse_ratings %>%
group_by(horse) %>%
summarise(mean_rating = mean(DaH[Dist>10], na.rm = TRUE))
# A tibble: 3 x 2
# horse mean_rating
# <dbl> <dbl>
#1 1 NaN
#2 2 97.5
#3 3 82
library(tidyverse)
Method 1:
horse_stats <-
horse_ratings %>%
mutate(raceid = as.factor(raceid)) %>%
filter(Dist > 10) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
left_join(horse_ratings %>%
select(horse) %>%
distinct(),
., by = "horse", all.x = TRUE)
Method 2 :
horse_stats <-
horse_ratings %>%
mutate(raceid = factor(raceid),
Dist = ifelse(Dist <= 10, 0, Dist),
DaH = ifelse(Dist == 0, 0, Dist)) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
mutate_if(is.numeric, list(~na_if(., 0)))

Can I use summarise_at for existing variables while adding other variables at the same time?

Suppose I have a grouped data frame:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(blah = mean(disp))
# A tibble: 3 x 2
cyl blah
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
Then suppose I want to sum some existing variables:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum)
# A tibble: 3 x 5
cyl vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 10 8 45 17
2 6 4 3 27 24
3 8 0 2 46 49
However, if I want to add both summarise commands together, I cannot:
> mtcars %>%
+ group_by(cyl) %>%
+ summarise_at(vars(vs:carb), sum) %>%
+ summarise(blah = mean(disp))
Error in mean(disp) : object 'disp' not found
After using group_by() in a dplyr chain, Hhow can I add new features with summarise() as well as summing existing features as above with summarise_at(vars(vs:carb), sum)?
The only way I can think of (at the moment) is the store the data immediately before your first summary, then run two summary verbs, and join them on the grouped variable. For instance:
library(dplyr)
grouped_data <- group_by(mtcars, cyl)
left_join(
summarize(grouped_data, blah = mean(disp)),
summarize_at(grouped_data, vars(vs:carb), sum),
by = "cyl")
# # A tibble: 3 x 6
# cyl blah vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 4 105. 10 8 45 17
# 2 6 183. 4 3 27 24
# 3 8 353. 0 2 46 49
You can left_join with the dataframe resulting from the summarise.
library(dplyr)
data(mtcars)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars(vs:carb), sum) %>%
left_join(mtcars %>% group_by(cyl) %>% summarise(blah = mean(disp)))
#Joining, by = "cyl"
## A tibble: 3 x 6
# cyl vs am gear carb blah
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
What I would do is use mutate_at for first step so that other columns are not collapsed and then use summarise_at with mean for all the columns together.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate_at(vars(vs:carb), sum) %>%
summarise_at(vars(vs:carb, disp), mean)
# cyl vs am gear carb disp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 10 8 45 17 105.
#2 6 4 3 27 24 183.
#3 8 0 2 46 49 353.
Here's a way, we need to define an helper function first and it works only in a pipe chain and uses unexported functions from dplyr though so might break one day.
.at <- function(.vars, .funs, ...) {
# make sure we are in a piped call
in_a_piped_fun <- exists(".",parent.frame()) &&
length(ls(envir=parent.frame(), all.names = TRUE)) == 1
if (!in_a_piped_fun)
stop(".at() must be called as an argument to a piped function")
# borrow code from summarize_at
.tbl <- try(eval.parent(quote(.)))
dplyr:::manip_at(
.tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
.include_group_vars = TRUE, ...)
}
library(dplyr, warn.conflicts = FALSE)
mtcars %>%
summarize(!!!.at(vars(vs:carb), sum), blah = mean(disp))
#> vs am gear carb blah
#> 1 14 13 118 90 230.7219
Created on 2019-11-17 by the reprex package (v0.3.0)

filter inside dplyr's summarise

I want to use filter or similar function inside summarise from dplyr package. So I've got a dataframe (e.g. mtcars) where I need to group by factor (e.g. cyl) and then calculate some statistics and a percentage of total wt for every cyl type —> wt.pc.
The question is how can I subset/filter wt column inside summarise function to get a percentage but without last 10 rows?
I've tried this code but it returns NA:(
mtcars %>%
group_by(cyl) %>%
summarise(wt = round(sum(wt)),
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[1:22]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 NA 5
2 6 22 21.4 NA 4
3 8 56 54.4 NA 4
wt.pc.short — % of sum(wt) for every cyl for shorter dataframe mtcars[1:22,]
Something like this?
mtcars %>%
mutate(id = row_number()) %>%
group_by(cyl) %>%
summarise(wt_new = round(sum(wt)), # note the change in name here!
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[id<23]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt_new wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 22.7 5
2 6 22 21.4 25.8 4
3 8 56 54.4 51.6 4
The important part here is that when you assign wt in the call to summarize, all subsequent references to wt will take the previously assigned wt, not the original wt. A statement such as wt[1:22] is thus somewhat problematic. You can see this here:
mean(mtcars[,"mpg"])
# [1] 20.09062
var(mtcars[,"mpg"])
# [1] 36.3241
mtcars %>% summarise(var_before = var(mpg),
mpg = mean(mpg),
var_after = var(mpg))
# var_before mpg var_after
# 1 36.3241 20.09062 NA
I think you can do it like this. First we calculate the row number within the group, if max(row_number) > 10 then we have enough observations to remove the last 10 rows, in which case we filter to max(ID)-9 (i.e. remove the last 10 rows), otherwise ID==ID returns true and doesn't remove anything.
mtcars %>% group_by(cyl) %>%
mutate(ID = row_number()) %>%
filter(if (max(ID) > 10) ID < (max(ID) - 9) else ID == ID)

Resources