Calculate total elapsed time - r

I have the following data:
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
x = c(30, 60, 90, 30, 60, 90),
phase = c(rep(c("phase 1", "phase 2"), each = 3))
)
df
#> # A tibble: 6 x 2
#> x phase
#> <dbl> <chr>
#> 1 30 phase 1
#> 2 60 phase 1
#> 3 90 phase 1
#> 4 30 phase 2
#> 5 60 phase 2
#> 6 90 phase 2
Created on 2020-08-11 by the reprex package (v0.3.0)
Where x is the elapsed time (in seconds) within each phase. Since phase is something that happens continuously, I am interested in calculating the total elapsed time.
Desired output:
#> # A tibble: 6 x 3
#> x phase elapsed_time
#> <dbl> <chr> <dbl>
#> 1 30 phase 1 30
#> 2 60 phase 1 60
#> 3 90 phase 1 90
#> 4 30 phase 2 120
#> 5 60 phase 2 150
#> 6 90 phase 2 180
Any ideas? Please, note that my real example has much more phases.

I believe the following post has the answer you're looking for:
How to add a cumulative column to an R dataframe using dplyr?
It shows how to create a cumulative column using group_by and mutate. It appears you want the elapsed time to sum across both phases, so simply do not include the group_by call in your code.

Here is an idea via dplyr. First we need to group by the phase and get the time differences for each. We then ungroup() and take the cumsum() as a total, i.e.
library(dplyr)
df %>%
group_by(phase) %>%
mutate(diffs = x - lag(x, default = 0)) %>%
ungroup() %>%
mutate(res = cumsum(diffs)) %>%
select(-diffs)
# A tibble: 6 x 3
# x phase res
# <dbl> <chr> <dbl>
#1 30 phase 1 30
#2 60 phase 1 60
#3 90 phase 1 90
#4 30 phase 2 120
#5 60 phase 2 150
#6 90 phase 2 180

Here is another dplyr solution. It finds the start of each phase, and adds this to x
library(tidyverse)
df <- tibble(
x = c(30, 60, 90, 30, 60, 90),
phase = c(rep(c("phase 1", "phase 2"), each = 3))
)
df %>% group_by(phase) %>%
nest() %>%
mutate(start = map_dbl(data, max)) %>%
ungroup() %>%
mutate(start = lag(start, default = 0)) %>%
unnest(data) %>%
mutate(elapsed_time = start + x) %>%
select(-start)
#> # A tibble: 6 x 3
#> phase x elapsed_time
#> <chr> <dbl> <dbl>
#> 1 phase 1 30 30
#> 2 phase 1 60 60
#> 3 phase 1 90 90
#> 4 phase 2 30 120
#> 5 phase 2 60 150
#> 6 phase 2 90 180
Created on 2020-08-11 by the reprex package (v0.3.0)

Related

Select rows within an overlapping range based on another column in R tidyverse

I have a data frame that looks like this:
the col1 defines the start of a range when the direction is " + " while the col2 establishes the beginning of a range when the direction is " - ".
library(tidyverse)
df <- tibble(col1=c(1,10,100,40,1000), col2=c(15,20,50,80,2000),
direction=c("+","+","-","+","+"), score=c(50,100,300,10,300))
df
#> # A tibble: 5 × 4
#> col1 col2 direction score
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 15 + 50
#> 2 10 20 + 100
#> 3 100 50 - 300
#> 4 40 80 + 10
#> 5 1000 2000 + 300
Created on 2022-07-28 by the reprex package (v2.0.1)
By considering the direction, I want to extract from the rows with overlapping ranges the ones with the highest score.
I want my data to look like this.
#> col1 col2 direction score
#> <dbl> <dbl> <chr> <dbl>
#> 1 10 20 + 100
#> 3 100 50 - 300
#> 5 1000 2000 + 300
Any ideas and help are highly appreciated.
We could use slice_max after grouping by rleid on the 'direction'
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(direction)) %>%
slice_max(n = 1, order_by = score) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 3 × 4
col1 col2 direction score
<dbl> <dbl> <chr> <dbl>
1 10 20 + 100
2 100 50 - 300
3 1000 2000 + 300

Divide group sum by total sum

I am using the dplyr package. Let's suppose I have the below table.
Group
count
A
20
A
10
B
30
B
35
C
50
C
60
My goal is to create a summary table that contains the mean per each group, and also, the percentage of the mean of each group compared to the total means added together. So the final table will look like this:
Group
avg
prcnt_of_total
A
15
.14
B
32.5
.31
C
55
.53
For example, 0.14 is the result of the following calculation: 15/(15+32.5+55)
Right now, I was only able to produce the first column code that calculates the mean for each group:
summary_df<- df %>%
group_by(Group)%>%
summarise(avg=mean(count))
I still don't know how to produce the prcnt_of_total column. Any suggestions?
You can use the following code:
df <- read.table(text="Group count
A 20
A 10
B 30
B 35
C 50
C 60", header = TRUE)
library(dplyr)
df %>%
group_by(Group) %>%
summarise(avg = mean(count)) %>%
ungroup() %>%
mutate(prcnt_of_total = prop.table(avg))
#> # A tibble: 3 × 3
#> Group avg prcnt_of_total
#> <chr> <dbl> <dbl>
#> 1 A 15 0.146
#> 2 B 32.5 0.317
#> 3 C 55 0.537
Created on 2022-07-14 by the reprex package (v2.0.1)
We can drop the group in summarise itself.
library(dplyr)
df1 %>%
group_by(Group) %>%
summarise(avg = mean(count), .groups = "drop") %>%
mutate(prcnt_of_total = avg/sum(avg))
#> # A tibble: 3 x 3
#> Group avg prcnt_of_total
#> <chr> <dbl> <dbl>
#> 1 A 15 0.146
#> 2 B 32.5 0.317
#> 3 C 55 0.537
On another note, I am not sure if getting the average divided by the sum of averages is a meaningful metric unless we are sure to have the same number of entries per group. Given that, I suggested another solution as well.
## if you always have the same number of rows between the groups
df1 %>%
group_by(Group) %>%
summarise(avg = mean(count),
prcnt_of_total = sum(count)/sum(.$count))
#> # A tibble: 3 x 3
#> Group avg prcnt_of_total
#> <chr> <dbl> <dbl>
#> 1 A 15 0.146
#> 2 B 32.5 0.317
#> 3 C 55 0.537
Data:
read.table(text = "Group count
A 20
A 10
B 30
B 35
C 50
C 60",
header = T, stringsAsFactors = F) -> df1
You can do this:
df %>%
group_by(Group) %>%
summarize(avg = mean(count), prcent_of_total = sum(count)/sum(df$count))
Output:
Group avg prcent_of_total
<chr> <dbl> <dbl>
1 A 15 0.146
2 B 32.5 0.317
3 C 55 0.537
data.table is similar:
library(data.table)
setDT(df)[,.(avg = mean(count), prcent_of_total = sum(count)/sum(df$count)),Group]

count rows with conditons on summarise

I am trying to count records on summarize given conditions as the code bellow but it is not working. I am trying to count total rows, not a distinct count.
What am I doing wrong?
library(dplyr)
data %>%
group_by(UNIT) %>%
summarize(TOTAL = n(),
SLA = n(DATETIME[TYPE=='A' || TYPE=='B']))
Appreciate any help
use sum instead of n
library(dplyr)
set.seed(123)
data <- data.frame(
UNIT = sample(1:3, size = 100, replace = TRUE),
TYPE = sample(c('A', 'B', 'C'), size = 100, replace = TRUE)
)
head(data)
#> UNIT TYPE
#> 1 3 C
#> 2 3 B
#> 3 3 B
#> 4 2 C
#> 5 3 A
#> 6 2 A
data %>%
group_by(UNIT) %>%
summarise(TOTAL = n(), SLA = sum(TYPE == 'A' | TYPE == 'B'))
#> # A tibble: 3 × 3
#> UNIT TOTAL SLA
#> <int> <int> <int>
#> 1 1 33 22
#> 2 2 32 19
#> 3 3 35 27
Created on 2022-02-18 by the reprex package (v2.0.1)

Iterating over values and variable names in dplyr::summarise

I'm using the following script to make a table in R:
library(dplyr)
library(tidyr)
get_probability <- function(parameter_array, threshold) {
return(round(100 * sum(parameter_array >= threshold) /
length(parameter_array)))
}
thresholds = c(75, 100, 125)
mtcars %>% group_by(gear) %>%
dplyr::summarise(
low=get_probability(disp, thresholds[[1]]),
medium=get_probability(disp, thresholds[[2]]),
high=get_probability(disp, thresholds[[3]]),
)
The table that comes out is the following:
# A tibble: 3 x 4
gear low medium high
<dbl> <dbl> <dbl> <dbl>
1 3 100 100 93
2 4 92 67 50
3 5 100 80 60
My question is, how can condense what I have passed to summarise to a single line? i.e., is there a way to iterate over both the thresholds vector, also while passing custom variable names?
In recent versions of dplyr, summarise will auto-splice data.frames created within it into new columns. So, you just need a way to iterate over thresholds to create a data.frame.
One option is purrr:::map_dfc.
library(dplyr, warn.conflicts = FALSE)
get_probability <- function(parameter_array, threshold) {
return(round(100 * sum(parameter_array >= threshold) /
length(parameter_array)))
}
thresholds = c(75, 100, 125)
thresholds <- setNames(thresholds, c('low', 'medium', 'high'))
mtcars %>%
group_by(gear) %>%
summarise(purrr::map_dfc(thresholds, ~ get_probability(disp, .x)))
#> # A tibble: 3 × 4
#> gear low medium high
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 100 100 93
#> 2 4 92 67 50
#> 3 5 100 80 60
If you prefer not to use an extra package though, you could just lapply and then convert the output to data.frame. (Replace \(x) with function(x) in older versions of R)
mtcars %>%
group_by(gear) %>%
summarise(as.data.frame(lapply(thresholds, \(x) get_probability(disp, x))))
#> # A tibble: 3 × 4
#> gear low medium high
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3 100 100 93
#> 2 4 92 67 50
#> 3 5 100 80 60
Created on 2021-08-17 by the reprex package (v2.0.1)

Using Filter function in R. Need to assign NA and keep length of dataset the same for Horse Racing Database

I'm still new to the group and R.
I had some really helpful feedback on my last query so hoping I can get
some more support with the following:
I am working on a horse racing database that at this stage has 4 variables:
race horse number, race id, distance of race and the rating (DaH) assigned for the horses
performance for the race.
The dataset:
horse_ratings <- tibble(
horse=c(1,1,1,2,2,2,3,3,3),
raceid=c(1,2,3,1,2,3,1,2,3),
Dist=c(9.47,9.47,10,10.1,10.2,9,11,9.47,10.5),
DaH=c(101,99,103,101,94,87,102,96,62)
)
Giving:
> horse_ratings
# A tibble: 9 x 4
horse raceid Dist DaH
<dbl> <dbl> <dbl> <dbl>
1 1 1 9.47 101
2 1 2 9.47 99
3 1 3 10 103
4 2 1 10.1 101
5 2 2 10.2 94
6 2 3 9 87
7 3 1 11 102
8 3 2 9.47 96
9 3 3 10.5 62
I will perform a number of calculations on the dataset such as mean rating, max rating etc
which id like to result in a number of vectors of equal length.
I'm using the filter function to look at the performance ratings achieved for different
race distances (ie. Distance greater than 10 to begin). However, if one of the horses has not
run a race for that distance then i've noticed that the result does not include that
horse in the output. ie:
> horse_ratings %>%
+ group_by(horse) %>%
+ filter(Dist>10) %>%
+ summarise(mean_rating=mean(DaH))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 2 97.5
2 3 82
So horse 1 has disappeared as it has not run a race of distance greater than 10.
I need to keep the output vector of length 3 ideally so I can put all the calculations
in to a dataframe of same length (for my final data output/print out).
I'm hoping there's a way of assigning an NA or similar to an output for horse 1
Giving:
# A tibble: 2 x 2
horse mean_rating
<dbl> <dbl>
1 1 NA
2 2 97.5
3 3 82
Or a similar solution.
Help would be much appreciated!!
You can use the .drop = FALSE parameter in group_by():
horse_ratings %>%
group_by(horse, .drop = FALSE) %>%
filter(Dist > 10) %>%
summarise(mean_rating = mean(DaH))
horse mean_rating
<dbl> <dbl>
1 1 NaN
2 2 97.5
3 3 82
Don't filter first, do it in summarise so you don't drop groups (horse).
library(dplyr)
horse_ratings %>%
group_by(horse) %>%
summarise(mean_rating = mean(DaH[Dist>10], na.rm = TRUE))
# A tibble: 3 x 2
# horse mean_rating
# <dbl> <dbl>
#1 1 NaN
#2 2 97.5
#3 3 82
library(tidyverse)
Method 1:
horse_stats <-
horse_ratings %>%
mutate(raceid = as.factor(raceid)) %>%
filter(Dist > 10) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
left_join(horse_ratings %>%
select(horse) %>%
distinct(),
., by = "horse", all.x = TRUE)
Method 2 :
horse_stats <-
horse_ratings %>%
mutate(raceid = factor(raceid),
Dist = ifelse(Dist <= 10, 0, Dist),
DaH = ifelse(Dist == 0, 0, Dist)) %>%
group_by(horse) %>%
summarise_if(is.numeric, c("sum", "mean", "max", "min")) %>%
ungroup() %>%
mutate_if(is.numeric, list(~na_if(., 0)))

Resources