Count of rows per group between dates, R - r

I am trying to do a count of rows that fall on and between two dates (minimum and maximum) per group. The only caveat is each group has a different pair of dates. See example below.
This is my raw dataset.
raw <- data.frame ("Group" = c("A", "B", "A", "A", "B"), "Date" = c("2017-01-01", "2017-02-02", "2017-09-01", "2017-12-31", "2017-05-09"))
I would like it to return this...
clean <- data.frame ("Group" = c("A", "B"), "Min" = c("2017-01-01", "2017-02-02"), "Max" = c("2017-12-31", "2017-05-09"), "Count" = c(3, 2))
How would I be able to do this? The mix and max variable are not crucial, but definitely would like to know how to do the count variable. Thank you!

The date range is given or you want to calculate it from data as well. If later is true then this should do it.
require(tidyverse)
raw %>%
mutate(Date = as.Date(Date)) %>%
group_by(Group) %>%
summarise(min_date = min(Date), max_date = max(Date), count = n())
Output:
# A tibble: 2 x 4
Group min_date max_date count
<fct> <date> <date> <int>
1 A 2017-01-01 2017-12-31 3
2 B 2017-02-02 2017-05-09 2

Related

Count unique values per month in R

I have a dataset with dead bird records from field observers.
Death.Date Observer Species Bird.ID
1 03/08/2021 DA MF FC10682
2 15/08/2021 AG MF FC10698
3 12/01/2022 DA MF FC20957
4 09/02/2022 DA MF FC10708
I want to produce a dataset from this with the number of unique Bird.ID / Month so I can produce a graph from that. ("unique" because some people make mistakes and enter a bird twice sometimes).
The output in this case would be:
Month Number of dead
08/2021 2
01/2022 1
02/2022 1
The idea is to use the distinct function but by month (knowing the value is in date format dd/mm/yyyy).
In case your Date column is character type first transform to date type with dmy
Change format to month and year
group_by and summarize
library(dplyr)
library(lubridate) # in case your Date is in character format
df %>%
mutate(Death.Date = dmy(Death.Date)) %>% # you may not need this line
mutate(Month = format(as.Date(Death.Date), "%m/%Y")) %>%
group_by(Month) %>%
summarise(`Number of dead`=n())
Month `Number of dead`
<chr> <int>
1 01/2022 1
2 02/2022 1
3 08/2021 2
For completeness, this can be achieved using aggregate without any additional packages:
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)
aggregate.data.frame(
x = df["Bird.ID"],
by = list(death_month = format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")),
FUN = function(x) {length(unique(x))}
)
Notes
The anonymous function function(x) {length(unique(x)) provides the count of the unique values
format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")) call ensures that the month/Year string is provided
data.table solution
library(data.table)
library(lubridate)
# Reproductible example with a duplicated bird
deadbirds <- data.table::data.table(Death.Date = c("03/08/2021", "15/08/2021", "12/01/2022", "09/02/2022", "03/08/2021"),
Observer = c("DA", "AG", "DA", "DA", "DA"),
Species = c("MF", "MF", "MF" , "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708", "FC10682"))
# Clean dataset = option 1 : delete all duplicated row
deadbirds <- base::unique(deadbirds)
# Clean dataset = option 2 : keep only the first line by bird (can be useful when there is duplicated data with differents values in useless columns)
deadbirds <- deadbirds[
j = .SD[1],
by = c("Bird.ID")
]
# Death.Date as date
deadbirds <- deadbirds[
j = Death.Date := lubridate::dmy(Death.Date)
]
# Create month.Death.Date
deadbirds <- deadbirds[
j = month.Death.Date := base::paste0(lubridate::month(Death.Date),
"/",
lubridate::year(Death.Date))
]
# Count by month
deadbirds <- deadbirds[
j = `Number of dead` := .N,
by = month.Death.Date]
A possible solution, based on tidyverse, lubridate and zoo::as.yearmon:
library(tidyverse)
library(lubridate)
library(zoo)
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)
df %>%
group_by(date = as.yearmon(dmy(Death.Date))) %>%
summarise(nDead = n_distinct(Bird.ID), .groups = "drop")
#> # A tibble: 3 x 2
#> date nDead
#> <yearmon> <int>
#> 1 Aug 2021 2
#> 2 Jan 2022 1
#> 3 Feb 2022 1
You could use:
as.data.frame(table(format(as.Date(df$Death.Date,'%d/%m/%Y'), '%m/%Y')))
# Var1 Freq
# 1 01/2022 1
# 2 02/2022 1
# 3 08/2021 2
data:
df <- data.frame(
Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
Observer = c("DA", "AG", "DA", "DA"),
Species = c("MF", "MF", "MF", "MF"),
Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)

Deal with following dates in a table in R

My problem is easy to explain :
I have one table with start dates and end dates and n rows ordered by "start date" (see image bellow - Yellow rows are the ones I want to have on one unique row with first start date and last end date)
Table with rows where dates follow
I would like to regroup dates on one row when start date n+1 == end date n. Here is an exemple of what I need as a reslut (image below)
Result i need
I tried to use for loops that compare the two vectors of dates (vectors extracted from the columns) but it does not really work...
I tried something like this to identify start date and end date :
'''
a = sort(data$Date_debut)
b = sort(data$Date_fin)
for(i in 1:(length(a)-1)){
for(j in 2:length(a)){
datedeb = a[j-1]
if(b[i]+1 == a[j]){
while(b[i]+1 == a[j] ){
datefin = b[i+1]
i = i+1}
}
}
}
'''
datedeb = start date
datefin = end date
Thank you for your help, I am open to ideas / ways to deal with this.
Here is one approach using tidyverse. For each Var1 group, create subgroups containing an index based on when the start date does not equal the previous row end date (keeping those rows together with the same index). Then you can group_by both the Var1 and the index together, and use the first start date and last end date as your date ranges.
library(tidyverse)
df %>%
group_by(Var1) %>%
mutate(i = cumsum(Start_date != lag(End_date, default = as.Date(-Inf)) + 1)) %>%
group_by(i, .add = T) %>%
summarise(Start_date = first(Start_date), End_date = last(End_date)) %>%
select(-i)
Output
Var1 Start_date End_date
<chr> <date> <date>
1 A 2019-01-02 2019-04-09
2 A 2019-10-11 2019-10-11
3 B 2019-12-03 2019-12-20
4 C 2019-12-29 2019-12-31
Data
df <- structure(list(Var1 = c("A", "A", "A", "A", "B", "C"), Start_date = structure(c(17898,
17962, 17993, 18180, 18233, 18259), class = "Date"), End_date = structure(c(17961,
17992, 17995, 18180, 18250, 18261), class = "Date")), class = "data.frame", row.names = c(NA,
-6L))

Finding the mean time (independent of date) that an event occurs in R

I have a list of date-times corresponding to events that occurred over multiple days, and I am hoping to find the mean time that different categories of events occurred, independent of date (i.e., the mean time for events falling into category A was 11:04:42). I have a data frame that looks similar to the following:
df <- data.frame(category = c("A", "A", "B", "A", "C", "C", "B", "D", "A", "D", "D", "C"),
times = ymd_hms(c("2021-09-12 21:34:22", "2021-09-13 15:42:37",
"2021-09-16 22:36:50", "2021-09-24 09:41:00",
"2021-09-20 12:14:30", "2021-09-15 16:40:39",
"2021-09-15 09:16:39", "2021-09-14 15:50:47",
"2021-09-24 18:10:00", "2021-09-21 17:30:00",
"2021-09-14 17:43:53", "2021-09-23 19:00:00")))
I would like to find the mean time for all events in category A, but when I call something like mean(times), the output is a date and a time, whereas I would just like a time, independent of what day each event occurred.
As an example, I have tried summarizing the data frame, like so:
summary_times <- df %>%
group_by(category) %>%
summarize(avg_time = mean(times))
The result is "2021-09-18 06:20:06 UTC", which is not what I would like—I'm interested in generalizing to any given day, so I'm hoping for a time that does not take the dates of the individual events into account.
I have also tried taking the individual means of the hours, minutes, and seconds, and then taking the means of those individually, but I have not been successful with that, either. My first attempt looked like this:
summary_times <- df %>%
group_by(category) %>%
summarize(avg_time = paste(mean(hour(times)), ":",
mean(minute(times)), ":",
mean(second(times))))
This gave me a "time" (just as a character object, which is fine with me; this is just being displayed in a table), but each of the hour, minute, and seconds had decimal remainders. This problem led me to try this next iteration:
summary_times <- df %>%
group_by(category) %>%
summarize(avg_time = paste(sum(hour(times)) %/% n(), ":",
sum(minute(times)) %/% n() + (sum(hour(median_datetime)) %% n())*60, ":",
sum(second(times)) %/% n() + (sum(minute(median_datetime)) %% n())*60))
I no longer got decimal remainders on each component of the time; however, some of the components were greater than they could possibly be (e.g., a time of "15:247:130").
Any assistance in how to find this mean time in the day of events—either by pointing in the direction of a function that can perform this task, or by investigating the taking-the-mean-of-the-individual-components option—would be greatly appreciated!
An option is to convert to ITime and then get the mean
library(data.table)
library(dplyr)
df %>%
group_by(category) %>%
summarise(avg_time = mean(as.ITime(times)))
-output
# A tibble: 4 × 2
category avg_time
<chr> <ITime>
1 A 16:16:59
2 B 15:56:44
3 C 15:58:23
4 D 17:01:33
Or another option is to change the 'date' part to a standardized single date, and then take the mean, format to return only the 'time' part
df %>%
group_by(category) %>%
summarise(times = format(mean(as.POSIXct(format(times,
'2021-09-01 %H:%M:%S'))), '%H:%M:%S'))
# A tibble: 4 × 2
category times
<chr> <chr>
1 A 16:16:59
2 B 15:56:44
3 C 15:58:23
4 D 17:01:33
Or do this in base R
transform(aggregate(times ~ category,
data = transform(df,
times= as.POSIXct(format(times, '2021-09-01 %H:%M:%S'))), mean),
times = format(times, '%H:%M:%S'))
-output
category times
1 A 16:16:59
2 B 15:56:44
3 C 15:58:23
4 D 17:01:33

R: Pivot numeric data from columns to rows based on string in variable name

I have a data set that I want to pivot to long format depending on if the variable name contains any of the strings: list_a <- c("a", "b", "c") and list_b <- c("usd", "eur", "gbp"). The data set only contains values in one row. I want the values in list_b to become column names and the values in list_a to become row names in the resulting dataset. Please see the reproducable example data set below.
I currently solve this issue by applying the following R code (once for each value in list_b) resulting in three data frames called "df_usd", "df_eur" and "df_gbp" which I then merge based on the column "name". This is however a bit cumbersome and I would very much appreciate if you could help me with finding a more elegant solution since the variables in list_b change from month to month (list_a stays the same each month) and updating the existing code manually is both time consuming and opens up for manual error.
# Current solution for df_usd:
df_usd <- df %>%
select(date, contains("usd")) %>%
pivot_longer(cols = contains(c("a_", "b_", "c_")),
names_to = "name", values_to = "usd") %>% mutate(name = case_when(
str_detect(name, "a_") ~ "a",
str_detect(name, "b_") ~ "b",
str_detect(name, "c_") ~ "c")) %>%
select(-date)
A screenshot of the starting point in Excel
A screenshot of the result I want to acheive in Excel
# Example data to copy and paste into R for easy reproduction of problem:
df <- data.frame (date = c("2020-12-31"),
a_usd = c(1000),
b_usd = c(2000),
c_usd = c(3000),
a_eur = c(100),
b_eur =c(200),
c_eur = c(300),
a_gbp = c(10),
b_gbp = c(20),
c_gbp = c(30))
It would be to specify names_sep with names_to in pivot_longer
library(dplyr)
df %>%
pivot_longer(cols = -date, names_to = c("grp", ".value"), names_sep = "_")
-output
# A tibble: 3 x 5
# date grp usd eur gbp
# <chr> <chr> <dbl> <dbl> <dbl>
#1 2020-12-31 a 1000 100 10
#2 2020-12-31 b 2000 200 20
#3 2020-12-31 c 3000 300 30
A base R option using reshape
reshape(
setNames(df, gsub("(\\w+)_(\\w+)", "\\2.\\1", names(df))),
direction = "long",
varying = -1
)
gives
date time usd eur gbp id
1.a 2020-12-31 a 1000 100 10 1
1.b 2020-12-31 b 2000 200 20 1
1.c 2020-12-31 c 3000 300 30 1

Track changes for a given field given some ID and date

I have the following dataframe:
Each client's cap could be upgraded at some point in time defined by column Date. I would like to aggregate on ID and show on what date the cap has been upgraded. Sometimes this could happen twice. The output should look like this:
Thank you in advance !
library(tidyverse)
df <- tibble(
ID = c(1,1,1,2,2,2,2,3,3),
Cap = c("S", "S", "M", "S", "M", "L", "L", "S", "L"),
Date = paste("01", c(1:2, 4, 3:6, 2:3), "2000") %>% lubridate::dmy()
)
df2 <- df %>%
group_by(ID) %>% # looking at each ID separately
mutate(prev = lag(Cap), # what is the row - 1 value
change = !(Cap == prev)) %>% # is the row - 1 value different than the current row value
filter(change) %>% # filtering where there are changes
select(ID, "From" = prev, "To" = Cap, Date) # renaming columns and selecting the relevant ones
You can use the lag command here to create a column with the previous rows value of Cap included. Then you simply filter out first entries and rows which are the same.
out <- dat %>%
## calculate lag within unique subjects
group_by(ID) %>%
mutate(
## copy previous row value to new column
from=lag(Cap),
to=Cap
) %>%
ungroup() %>%
## ignore first entry for each subject
drop_na(from) %>%
## ignore all rows where Cap didn't change
filter(from != to) %>%
## reorder columns
select(ID, from, to, Date)
This gives us output matching your expected format
> out
# A tibble: 4 x 4
ID from to Date
<dbl> <fct> <fct> <dbl>
1 1 S M 4
2 2 S M 4
3 2 M L 5
4 3 S L 3

Resources