library(lubridate)
library(tidyverse)
step_count_raw <- read_csv("data/step-count/step-count.csv",
locale = locale(tz = "Australia/Melbourne"))
location <- read_csv("data/step-count/location.csv")
step_count <- step_count_raw %>%
rename_with(~ c("date_time", "date", "count")) %>%
left_join(location) %>%
mutate(location = replace_na(location, "Melbourne"))
step_count
#> # A tibble: 5,448 x 4
#> date_time date count location
#> <dttm> <date> <dbl> <chr>
#> 1 2019-01-01 09:00:00 2019-01-01 764 Melbourne
#> 2 2019-01-01 10:00:00 2019-01-01 913 Melbourne
#> 3 2019-01-02 00:00:00 2019-01-02 9 Melbourne
#> 4 2019-01-02 10:00:00 2019-01-02 2910 Melbourne
#> 5 2019-01-02 11:00:00 2019-01-02 1390 Melbourne
#> 6 2019-01-02 12:00:00 2019-01-02 1020 Melbourne
#> 7 2019-01-02 13:00:00 2019-01-02 472 Melbourne
#> 8 2019-01-02 15:00:00 2019-01-02 1220 Melbourne
#> 9 2019-01-02 16:00:00 2019-01-02 1670 Melbourne
#> 10 2019-01-02 17:00:00 2019-01-02 1390 Melbourne
#> # … with 5,438 more rows
I want to calculate average daily step counts for every location, from step_count. Then end up with a tibble called city_avg_steps.
expected output
#> # A tibble: 4 x 2
#> location avg_count
#> <chr> <dbl>
#> 1 Austin 7738.
#> 2 Denver 12738.
#> 3 Melbourne 7912.
#> 4 San Francisco 13990.
My code and output
city_avg_steps <- step_count%>%group_by(location)%>%summarise(avg_count=mean(count))
city_avg_steps
# A tibble: 4 x 2
location avg_count
<chr> <dbl>
1 Austin 721.
2 Denver 650.
3 Melbourne 530.
4 San Francisco 654.
I have a clue is to calculate daily number first then cumulate the result using two summarise fuction,but not sure how to add.
As #dash2 explains in the comments, from what we understand from your desired output, it requires a two stage aggregation, one to aggregate the number of steps per day (adding them together, using sum), the other is aggregating the different days into location level averages, using mean.
step_count %>%
group_by(date, location) %>%
summarise(sum_steps = sum(count, na.rm = TRUE)) %>%
ungroup %>%
group_by(date) %>%
summarise(avg_steps = mean(sum_steps, na.rm = TRUE))
Related
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 6 months ago.
I have a data.frame with some prices per day. I would like to get the average daily price in another column (avg_price). How can I do that ?
date price avg_price
1 2017-01-01 01:00:00 10 18.75
2 2017-01-01 01:00:00 10 18.75
3 2017-01-01 05:00:00 25 18.75
4 2017-01-01 04:00:00 30 18.75
5 2017-01-02 08:00:00 10 20
6 2017-01-02 08:00:00 30 20
7 2017-01-02 07:00:00 20 20
library(lubridate)
library(tidyverse)
df %>%
group_by(day = day(date)) %>%
summarise(avg_price = mean(price))
# A tibble: 2 x 2
day avg_price
<int> <dbl>
1 1 18.8
2 2 20
df %>%
group_by(day = day(date)) %>%
mutate(avg_price = mean(price))
# A tibble: 7 x 4
# Groups: day [2]
date price avg_price day
<dttm> <dbl> <dbl> <int>
1 2017-01-01 01:00:00 10 18.8 1
2 2017-01-01 01:00:00 10 18.8 1
3 2017-01-01 05:00:00 25 18.8 1
4 2017-01-01 04:00:00 30 18.8 1
5 2017-01-02 08:00:00 10 20 2
6 2017-01-02 08:00:00 30 20 2
7 2017-01-02 07:00:00 20 20 2
I am trying to convert below data on daily basis based on range available in start_date & end_date_ column.
to this output (sum):
Please use dput() when posting data frames next time!
Example data
# A tibble: 4 × 4
id start end inventory
<int> <chr> <chr> <dbl>
1 1 01/05/2022 02/05/2022 100
2 2 10/05/2022 15/05/2022 50
3 3 11/05/2022 21/05/2022 80
4 4 14/05/2022 17/05/2022 10
Transform the data
df %>%
mutate(across(2:3, ~ as.Date(.x,
format = "%d/%m/%Y"))) %>%
pivot_longer(cols = c(start, end), values_to = "date") %>%
arrange(date) %>%
select(date, inventory)
# A tibble: 8 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-10 50
4 2022-05-11 80
5 2022-05-14 10
6 2022-05-15 50
7 2022-05-17 10
8 2022-05-21 80
Expand the dates and left_join
left_join(tibble(date = seq(first(df$date),
last(df$date),
by = "day")), df)
# A tibble: 21 × 2
date inventory
<date> <dbl>
1 2022-05-01 100
2 2022-05-02 100
3 2022-05-03 NA
4 2022-05-04 NA
5 2022-05-05 NA
6 2022-05-06 NA
7 2022-05-07 NA
8 2022-05-08 NA
9 2022-05-09 NA
10 2022-05-10 50
# … with 11 more rows
library(tidyverse)
library(lubridate)
library(padr)
df <- tibble(`Action Item ID` = c("ABC", "DEF", "GHI", "JKL", "MNO", "PQR"),
`Date Created` = as.Date(c("2019-01-01", "2019-01-01",
"2019-06-01", "2019-06-01",
"2019-08-01", "2019-08-01")),
`Date Closed` = as.Date(c("2019-01-15", "2019-05-31",
"2019-06-15", "2019-07-05",
"2019-08-15", NA)),
`Current Status` = c(rep("Closed", 5), "Open")) %>%
pivot_longer(-c(`Action Item ID`, `Current Status`),
names_to = "Type",
values_to = "Date")
#> # A tibble: 12 x 4
#> `Action Item ID` `Current Status` Type Date
#> <chr> <chr> <chr> <date>
#> 1 ABC Closed Date Created 2019-01-01
#> 2 ABC Closed Date Closed 2019-01-15
#> 3 DEF Closed Date Created 2019-01-01
#> 4 DEF Closed Date Closed 2019-05-31
#> 5 GHI Closed Date Created 2019-06-01
#> 6 GHI Closed Date Closed 2019-06-15
#> 7 JKL Closed Date Created 2019-06-01
#> 8 JKL Closed Date Closed 2019-07-05
#> 9 MNO Closed Date Created 2019-08-01
#> 10 MNO Closed Date Closed 2019-08-15
#> 11 PQR Open Date Created 2019-08-01
#> 12 PQR Open Date Closed NA
I've got my data frame above and I'm trying to pad dates within each group with the padr R package.
df %>% group_by(`Action Item ID`) %>% pad()
#> Error: Not all grouping variables are column names of x.
The error doesn't make much sense to me. I'm looking for output that would look like the following:
#> # A tibble: ? x 4
#> `Action Item ID` `Current Status` Type Date
#> <chr> <chr> <chr> <date>
#> ABC Closed Date Created 2019-01-01
#> ABC NA NA 2019-01-02
#> ABC NA NA 2019-01-03
#> ... ... ... ...
#> ABC Closed Date Closed 2019-01-15
#> DEF Closed Date Created 2019-01-01
#> DEF NA NA 2019-01-02
#> ... ... ... ...
#> DEF NA NA 2019-05-30
#> DEF Closed Date Closed 2019-05-31
#> GHI Closed Date Created 2019-06-01
#> ... ... ... ...
Anybody have any idea what went wrong?
According to ?pad, there is a group argument
group - Optional character vector that specifies the grouping variable(s). Padding will take place within the different groups. When interval is not specified, it will be determined applying get_interval on the datetime variable as a whole, ignoring groups (see last example).
So, it is better to make use of that parameter
library(dplyr)
library(padr)
df %>%
pad(group = "Action Item ID")
# A tibble: 233 x 4
# `Action Item ID` `Current Status` Type Date
# <chr> <chr> <chr> <date>
# 1 ABC Closed Date Created 2019-01-01
# 2 ABC <NA> <NA> 2019-01-02
# 3 ABC <NA> <NA> 2019-01-03
# 4 ABC <NA> <NA> 2019-01-04
# 5 ABC <NA> <NA> 2019-01-05
# 6 ABC <NA> <NA> 2019-01-06
# 7 ABC <NA> <NA> 2019-01-07
# 8 ABC <NA> <NA> 2019-01-08
# 9 ABC <NA> <NA> 2019-01-09
#10 ABC <NA> <NA> 2019-01-10
# … with 223 more rows
Although I have achieved what I want (see result below), I find my approach kind of convoluted. I would like to remove all observations till the nearest observation before a cut-off date (here cut-off), by group. I cannot simply calculate the nearest observation using min(abs(x - date)) because there can be ties of observations that are equally distant in the positive and negative (e.g., group "b" in mydf).
I solved it by looking for the index of the last observation where date - cut_off <=0, and then using this index in dplyr::slice(). I am very curious about other approaches though.
Very very open for non-dplyr solutions.
Every group contains at least one row where observation date - cutoff date <= 0
library(tidyverse)
set.seed(8)
mydf <- data.frame(group = rep(letters[1:3], each = 5), date1 = as.Date(sample(15), origin = '1970-01-01'), cut_off = as.Date(rep(sample(10, 3), each = 5), origin = '1970-01-01'))
mydf %>% arrange(group, date1) %>% group_by(group) %>%
mutate(diff = date1 - cut_off,
min_abs = min(abs(date1 - cut_off)))
#> # A tibble: 15 x 5
#> # Groups: group [3]
#> group date1 cut_off diff min_abs
#> <fct> <date> <date> <drtn> <drtn>
#> 1 a 1970-01-03 1970-01-05 -2 days 0 days
#> 2 a 1970-01-05 1970-01-05 0 days 0 days
#> 3 a 1970-01-08 1970-01-05 3 days 0 days
#> 4 a 1970-01-13 1970-01-05 8 days 0 days
#> 5 a 1970-01-15 1970-01-05 10 days 0 days
#> 6 b 1970-01-02 1970-01-09 -7 days 2 days
#> 7 b 1970-01-06 1970-01-09 -3 days 2 days
#> 8 b 1970-01-07 1970-01-09 -2 days 2 days
#> 9 b 1970-01-11 1970-01-09 2 days 2 days
#> 10 b 1970-01-12 1970-01-09 3 days 2 days
#> 11 c 1970-01-04 1970-01-11 -7 days 1 days
#> 12 c 1970-01-09 1970-01-11 -2 days 1 days
#> 13 c 1970-01-10 1970-01-11 -1 days 1 days
#> 14 c 1970-01-14 1970-01-11 3 days 1 days
#> 15 c 1970-01-16 1970-01-11 5 days 1 days
# min(abs(x)) does not help when the distance from neg and pos values is tied, see group b
Desired result (solution already pretty ok)
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
mutate(diff = date1 - cut_off) %>%
slice(max(which(diff <= 0)):n())
# finds index of last element in "diff" which fullfills condition
#> # A tibble: 10 x 4
#> # Groups: group [3]
#> group date1 cut_off diff
#> <fct> <date> <date> <drtn>
#> 1 a 1970-01-05 1970-01-05 0 days
#> 2 a 1970-01-08 1970-01-05 3 days
#> 3 a 1970-01-13 1970-01-05 8 days
#> 4 a 1970-01-15 1970-01-05 10 days
#> 5 b 1970-01-07 1970-01-09 -2 days
#> 6 b 1970-01-11 1970-01-09 2 days
#> 7 b 1970-01-12 1970-01-09 3 days
#> 8 c 1970-01-10 1970-01-11 -1 days
#> 9 c 1970-01-14 1970-01-11 3 days
#> 10 c 1970-01-16 1970-01-11 5 days
Created on 2019-12-16 by the reprex package (v0.3.0)
Here are couple of approaches with dplyr :
We can use top_n to select top n dates from each group where n is calculated for each group differently based on number of values which are greater than cut_off.
library(dplyr)
mydf %>% group_by(group) %>% top_n(sum(date1 > cut_off) + 1, date1)
# group date1 cut_off
# <fct> <date> <date>
# 1 a 1970-01-05 1970-01-05
# 2 a 1970-01-08 1970-01-05
# 3 a 1970-01-13 1970-01-05
# 4 a 1970-01-15 1970-01-05
# 5 b 1970-01-11 1970-01-09
# 6 b 1970-01-12 1970-01-09
# 7 b 1970-01-07 1970-01-09
# 8 c 1970-01-14 1970-01-11
# 9 c 1970-01-16 1970-01-11
#10 c 1970-01-10 1970-01-11
Although this selects the rows correctly but note that top_n does not sort the data so you might want to add arrange(group, date1) at the end of the chain.
Another approach is similar to the one posted in OP using slice
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
slice((which.max(date1 > cut_off) - 1):n())
# group date1 cut_off
# <fct> <date> <date>
# 1 a 1970-01-05 1970-01-05
# 2 a 1970-01-08 1970-01-05
# 3 a 1970-01-13 1970-01-05
# 4 a 1970-01-15 1970-01-05
# 5 b 1970-01-07 1970-01-09
# 6 b 1970-01-11 1970-01-09
# 7 b 1970-01-12 1970-01-09
# 8 c 1970-01-10 1970-01-11
# 9 c 1970-01-14 1970-01-11
#10 c 1970-01-16 1970-01-11
We can also tweak this to use in filter.
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
filter(row_number() >= which.max(date1 > cut_off) - 1)
which can be translated in base R as :
new_df <- mydf[with(mydf, order(group, date1)), ]
subset(new_df, ave(date1 > cut_off, group, FUN = function(x)
seq_along(x) >= (which.max(x) - 1)))
You could detect the rows where the dates are above your threshold, then use lead on them to keep one more value :
library(dplyr)
mydf %>%
arrange(group,date1) %>%
group_by(group) %>%
filter(lead(date1 > cut_off, default = TRUE)) %>%
ungroup()
#> # A tibble: 10 x 3
#> group date1 cut_off
#> <fct> <date> <date>
#> 1 a 1970-01-05 1970-01-05
#> 2 a 1970-01-08 1970-01-05
#> 3 a 1970-01-13 1970-01-05
#> 4 a 1970-01-15 1970-01-05
#> 5 b 1970-01-07 1970-01-09
#> 6 b 1970-01-11 1970-01-09
#> 7 b 1970-01-12 1970-01-09
#> 8 c 1970-01-10 1970-01-11
#> 9 c 1970-01-14 1970-01-11
#> 10 c 1970-01-16 1970-01-11
I would like to remove all observations up to the nearest observation
to a given date, by group. I cannot simply calculate the nearest
observation using min(abs(x - date)) because there can be ties of
observations that are equally distant in the positive and negative
(e.g., group "b" in mydf).
Your criteria is tied, you need to deliberately choose a way to break ties - you could choose to pick whichever entry comes first: (or last, check ?top_n)
mydf %>%
mutate(diff = abs(date1- cut_off)) %>%
arrange(group, diff) %>%
group_by(group) %>%
top_n(n = 1, wt = -diff )
# A tibble: 3 x 4
# Groups: group [3]
group date1 cut_off diff
<fct> <date> <date> <drtn>
1 a 1970-01-12 1970-01-11 1 days
2 b 1970-01-07 1970-01-02 5 days
3 c 1970-01-03 1970-01-04 1 days
*for some reason I got different values even while using your seed (8)
If you need to filter only cases where diff <= 0, just add that to the pipe chain.
The sorting and diff is a wise move. So for the last part, since your dates are already sorted, for each group, you use which.min to call out the row that is nearest, and keep rows that are >= this value:
mydf %>%
arrange(group, date1) %>%
group_by(group) %>%
mutate(delta = abs(date1- cut_off)) %>%
filter(1:n() >= max(which(delta == max(delta[delta<=0]))))
# A tibble: 10 x 4
# Groups: group [3]
group date1 cut_off delta
<fct> <date> <date> <drtn>
1 a 1970-01-05 1970-01-05 0 days
2 a 1970-01-08 1970-01-05 3 days
3 a 1970-01-13 1970-01-05 8 days
4 a 1970-01-15 1970-01-05 10 days
5 b 1970-01-07 1970-01-09 2 days
6 b 1970-01-11 1970-01-09 2 days
7 b 1970-01-12 1970-01-09 3 days
8 c 1970-01-10 1970-01-11 1 days
9 c 1970-01-14 1970-01-11 3 days
10 c 1970-01-16 1970-01-11 5 days
I've got a fairly straight-forward problem, but I'm struggling to find a solution that doesn't require a wall of code and complicated loops.
I've got a summary table, df, for an hourly timeseries dataset where each observations belongs to a group.
I want to merge some of those groups, based on a boolean column in the summary table.
The boolean column, merge_with_next indicates whether a given group should be merged with the next group (one row down).
The merging effectively occurs by updating the end, value and removing rows:
library(dplyr)
# Demo data
df <- tibble(
group = 1:12,
start = seq.POSIXt(as.POSIXct("2019-01-01 00:00"), as.POSIXct("2019-01-12 00:00"), by = "1 day"),
end = seq.POSIXt(as.POSIXct("2019-01-01 23:59"), as.POSIXct("2019-01-12 23:59"), by = "1 day"),
merge_with_next = rep(c(TRUE, TRUE, FALSE), 4)
)
df
#> # A tibble: 12 x 4
#> group start end merge_with_next
#> <int> <dttm> <dttm> <lgl>
#> 1 1 2019-01-01 00:00:00 2019-01-01 23:59:00 TRUE
#> 2 2 2019-01-02 00:00:00 2019-01-02 23:59:00 TRUE
#> 3 3 2019-01-03 00:00:00 2019-01-03 23:59:00 FALSE
#> 4 4 2019-01-04 00:00:00 2019-01-04 23:59:00 TRUE
#> 5 5 2019-01-05 00:00:00 2019-01-05 23:59:00 TRUE
#> 6 6 2019-01-06 00:00:00 2019-01-06 23:59:00 FALSE
#> 7 7 2019-01-07 00:00:00 2019-01-07 23:59:00 TRUE
#> 8 8 2019-01-08 00:00:00 2019-01-08 23:59:00 TRUE
#> 9 9 2019-01-09 00:00:00 2019-01-09 23:59:00 FALSE
#> 10 10 2019-01-10 00:00:00 2019-01-10 23:59:00 TRUE
#> 11 11 2019-01-11 00:00:00 2019-01-11 23:59:00 TRUE
#> 12 12 2019-01-12 00:00:00 2019-01-12 23:59:00 FALSE
# Desired result
desired <- tibble(
group = c(1, 4, 7, 9),
start = c("2019-01-01 00:00", "2019-01-04 00:00", "2019-01-07 00:00", "2019-01-10 00:00"),
end = c("2019-01-03 23:59", "2019-01-06 23:59", "2019-01-09 23:59", "2019-01-12 23:59")
)
desired
#> # A tibble: 4 x 3
#> group start end
#> <dbl> <chr> <chr>
#> 1 1 2019-01-01 00:00 2019-01-03 23:59
#> 2 4 2019-01-04 00:00 2019-01-06 23:59
#> 3 7 2019-01-07 00:00 2019-01-09 23:59
#> 4 9 2019-01-10 00:00 2019-01-12 23:59
Created on 2019-03-22 by the reprex package (v0.2.1)
I'm looking for a short and clear solution that doesn't involve a myriad of helper tables and loops. The final value in the group column is not significant, I only care about the start and end columns from the result.
We can use dplyr and create groups based on every time TRUE value occurs in merge_with_next column and select first value from start and last value from end column for each group.
library(dplyr)
df %>%
group_by(temp = cumsum(!lag(merge_with_next, default = TRUE))) %>%
summarise(group = first(group),
start = first(start),
end = last(end)) %>%
ungroup() %>%
select(-temp)
# group start end
# <int> <dttm> <dttm>
#1 1 2019-01-01 00:00:00 2019-01-03 23:59:00
#2 4 2019-01-04 00:00:00 2019-01-06 23:59:00
#3 7 2019-01-07 00:00:00 2019-01-09 23:59:00
#4 10 2019-01-10 00:00:00 2019-01-12 23:59:00