This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
I have a data frame with time windows on each row. The time window is identified by a start_date and end _date for each ID.
For each calendar day, I would like to know how may IDs have a time window spanning that day.
Example data
data <- data.frame(
id = c("A","B","C"),
start_date = as.POSIXct(c("2020-01-01 01:00:00", "2020-01-02 01:00:00", "2020-01-03 01:00:00")),
end_date = as.POSIXct(c("2020-01-04 01:00:00", "2020-01-03 01:00:00", "2020-01-06 01:00:00")),
stringsAsFactors = FALSE
)
data
id start_date end_date
1 A 2020-01-01 01:00:00 2020-01-04 01:00:00
2 B 2020-01-02 01:00:00 2020-01-03 01:00:00
3 C 2020-01-03 01:00:00 2020-01-06 01:00:00
The output I am looking for is to aggregate this into days with number of IDs present on each day.
day number_of_ids
2020-01-01 1
2020-01-02 2
2020-01-03 3
2020-01-04 2
2020-01-05 1
2020-01-06 1
Any help much appreciated.
We get the sequence of dates between corresponding 'start_date', 'end_date' in a list column, unnest the list column, then do a group by 'day' and get the number of distinct 'id' with n_distinct in summarise
library(dplyr)
library(purrr)
library(tidyr)
data %>%
transmute(id, day = map2(as.Date(start_date), as.Date(end_date),
~ seq(.x, .y, by = 'day'))) %>%
unnest(c(day)) %>%
group_by(day) %>%
summarise(number_of_ids = n_distinct(id))
# A tibble: 6 x 2
# day number_of_ids
# <date> <int>
#1 2020-01-01 1
#2 2020-01-02 2
#3 2020-01-03 3
#4 2020-01-04 2
#5 2020-01-05 1
#6 2020-01-06 1
In base R you could do:
a <- with(data, setNames(Map( function(x, y) format(seq(x,y,'1 day'), '%F'), start_date, end_date),id))
aggregate(ind~values, stack(a), length)
values ind
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 2
5 2020-01-05 1
6 2020-01-06 1
Related
I wish to calculate the intervals between dates. The differences in days should take weekends in account. I have over 200 dates stamps.
For example, the currently displayed time difference between 5th (Tuesday) and 11th (Monday) January are 5 days. I would like to obtain 3 days.
I could manage to get to a solution without excluding Saturday and Sunday with the following code and the packages lubridate and dplyr.
Could you please guide me how to exclude the weekends for calculation?
Thank you.
library(lubridate)
library(dplyr)
dates <- c("2021-01-01", "2021-01-04", "2021-01-05", "2021-01-06", "2021-01-11", "2021-01-13", "2021-01-14", "2021-01-18", "2021-01-25", "2021-01-29")
d <- do.call(rbind, lapply(dates, as.data.frame))
dateoverview <- rename(d, Dates = 1)
dateoverview$Dates <- lubridate::ymd(dateoverview$Dates)
datecalculation <- dateoverview %>%
mutate(Days = Dates - lag(Dates)) %>%
mutate(Weekday = wday(Dates, label = FALSE))
datecalculation
## Dates Days Weekday
## 1 2021-01-01 NA days 6
## 2 2021-01-04 3 days 2
## 3 2021-01-05 1 days 3
## 4 2021-01-06 1 days 4
## 5 2021-01-11 5 days 2
## 6 2021-01-13 2 days 4
## 7 2021-01-14 1 days 5
## 8 2021-01-18 4 days 2
## 9 2021-01-25 7 days 2
## 10 2021-01-29 4 days 6
Probably, there is a function somewhere already doing this but here is a custom one which can help you calculate date difference excluding weekends.
library(dplyr)
library(purrr)
date_diff_excluding_wekeends <- function(x, y) {
if(is.na(x) || is.na(y)) return(NA)
sum(!format(seq(x, y - 1, by = '1 day'), '%u') %in% 6:7)
}
datecalculation %>%
mutate(Days = map2_dbl(lag(Dates), Dates, date_diff_excluding_wekeends))
# Dates Days Weekday
#1 2021-01-01 NA 6
#2 2021-01-04 1 2
#3 2021-01-05 1 3
#4 2021-01-06 1 4
#5 2021-01-11 3 2
#6 2021-01-13 2 4
#7 2021-01-14 1 5
#8 2021-01-18 2 2
#9 2021-01-25 5 2
#10 2021-01-29 4 6
seq(x, y - 1, by = '1 day') creates a sequence of dates between previous date and current date - 1.
format(..., "%u") returns day of the week. 1 is for Monday, 7 for Sunday.
Using sum(!format(...) %in% 6:7) we count number of days that are present on weekdays.
Another possible solution:
library(lubridate)
# sample data
df = data.frame(Dates = seq(ymd('2021-01-01'),ymd('2021-12-31'),by='days'))
df_weekdays = df %>% filter(!(weekdays(as.Date(df$Dates)) %in% c('Saturday','Sunday')))
#Application to your data
datecalculation = datecalculation %>%
filter(!(weekdays(as.Date(datecalculation$Dates)) %in% c('Saturday','Sunday')))
was trying to figure a way to use dplyr to count the number of occurrences for each id at each time 1 hour ahead. Tried using a for loop but it doesn't give me the desired result. Went through stack and tried looking for various methods but to no avail. Any advise or help is greatly appreciated. Thanks
Dataset:
https://drive.google.com/file/d/1U186SeBWYyTnJVgUPmow7yknr6K9vu8i/view?usp=sharing
id date_time count
1 1 2019-12-27 00:00:00 NA
2 2 2019-12-27 00:00:00 NA
3 2 2019-12-27 00:55:00 NA
4 2 2019-12-27 01:00:00 NA
5 2 2019-12-28 01:00:00 NA
6 3 2019-12-27 22:00:00 NA
7 3 2019-12-27 22:31:00 NA
8 3 2019-12-28 14:32:00 NA
Desired Output
id date_time count
1 1 2019-12-27 00:00:00 1 #Count = 1 since there is no other cases 1 hour ahead but itself, only 1 case of id=1
2 2 2019-12-27 00:00:00 3 #Count = 3 as there are 3 cases from 00:00 to 01:00 on 27/12
3 2 2019-12-27 00:55:00 2 #Count = 2 as there are 2 cases from 00:55 to 01:55 on 27/12
4 2 2019-12-27 01:00:00 1 #Count = 1 as only itself from 01:00 to 02:00 on 27/12
5 2 2019-12-28 01:00:00 1 #Count = 1 as only itself from 01:00 to 02:00 on 28/12
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:31:00 1
8 3 2019-12-28 14:32:00 1
My codes (I'm stuck):
library(tidyverse)
data <- read.csv('test.csv')
data$date_time <- as.POSIXct(data$date_time)
data$count <- NA
data %>%
group_by(id) %>%
arrange(date_time, .by_group=TRUE)
#Doesn't give the desired output
for (i in 1:nrow(data)){
data$count[i] <- nrow(data[data$date_time<=data$date_time[i]+1*60*60 & data$date_time>=data$date_time[i],])
}
If OP is only looking for tidyverse solution. I am happy to delete this.
Here is an approach using data.table non-equi join:
DT[, onehrlater := date_time + 60*60]
DT[, count :=
DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater),
by=.EACHI, .N]$N
]
How to read this:
1) DT[, onehrlater := date_time + 60*60] creates a new column of POSIX date time that is one hour later. := updates the original dataset by reference.
2) DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater) performs a self non-equi join such that all rows with i) the same id, ii) date_time after this row's date_time and iii) date_time before this row's date_time one hour later are joined to this row.
3) by=.EACHI, .N returns the count for each of those rows. And $N accesses the output of this self non-equi join. And DT[, count := ...] updates the original dataset by reference.
output:
id date_time onehrlater count
1: 1 2019-12-27 00:00:00 2019-12-27 01:00:00 1
2: 2 2019-12-27 00:00:00 2019-12-27 01:00:00 3
3: 2 2019-12-27 00:55:00 2019-12-27 01:55:00 2
4: 2 2019-12-27 01:00:00 2019-12-27 02:00:00 1
5: 2 2019-12-28 01:00:00 2019-12-28 02:00:00 1
6: 3 2019-12-27 22:00:00 2019-12-27 23:00:00 2
7: 3 2019-12-27 22:31:00 2019-12-27 23:31:00 1
8: 3 2019-12-28 14:32:00 2019-12-28 15:32:00 1
data:
library(data.table)
DT <- fread("id date_time
1 2019-12-27T00:00:00
2 2019-12-27T00:00:00
2 2019-12-27T00:55:00
2 2019-12-27T01:00:00
2 2019-12-28T01:00:00
3 2019-12-27T22:00:00
3 2019-12-27T22:31:00
3 2019-12-28T14:32:00")
DT[, date_time := as.POSIXct(date_time, format="%Y-%m-%dT%T")]
The question can be solved using a non-equi self join (in data.table speak). Unfortunately, this is not yet available with dplyr, AFAIK.
Here is an implementation using SQL:
library(sqldf)
sqldf("
select d1.id, d1.date_time, count(d2.date_time) as count
from dat as d1, dat as d2
where d1.id = d2.id and d1.date_time between d2.date_time and (d2.date_time + 60*60)
group by d2.id, d2.date_time")
id date_time count
1 1 2019-12-27 00:00:00 1
2 2 2019-12-27 00:00:00 3
3 2 2019-12-27 00:55:00 2
4 2 2019-12-27 01:00:00 1
5 2 2019-12-28 01:00:00 1
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:31:00 1
8 3 2019-12-28 14:32:00 1
Data
# reading directly from google drive, see https://stackoverflow.com/a/33142446/3817004
dat <- data.table::fread(
"https://drive.google.com/uc?id=1U186SeBWYyTnJVgUPmow7yknr6K9vu8i&export=download")[
, date_time := anytime::anytime(date_time)]
Maybe fuzzyjoin might be helpful here. You can create time ranges for each row of data (setting the end_time to 3600 seconds or 1 hour after each time). Then, you can do a fuzzy join with itself, where the date_time falls between this range to be counted as within the hour.
library(tidyverse)
library(fuzzyjoin)
df %>%
mutate(row_id = row_number(),
end_time = date_time + 3600) %>%
fuzzy_inner_join(df,
by = c("id", "date_time" = "date_time", "end_time" = "date_time"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(row_id) %>%
summarise(id = first(id.x),
date_time = first(date_time.x),
count = n())
Output
# A tibble: 8 x 4
row_id id date_time count
<int> <int> <dttm> <int>
1 1 1 2019-12-27 00:00:00 1
2 2 2 2019-12-27 00:00:00 3
3 3 2 2019-12-27 00:55:00 2
4 4 2 2019-12-27 01:00:00 1
5 5 2 2019-12-28 01:00:00 1
6 6 3 2019-12-27 22:00:00 2
7 7 3 2019-12-27 22:31:00 1
8 8 3 2019-12-28 14:32:00 1
I'd probably just write a little helper function here along with the split-lapply-bind method rather than group_by:
f <- function(x)
{
sapply(1:nrow(x), function(i) {
y <- as.numeric(difftime(x$date_time, x$date_time[i], units = "min"))
sum(y >= 0 & y <= 60)
})
}
df %>% mutate(count = do.call(c, df %>% split(df$id) %>% lapply(f)))
#> id date_time count
#> 1 1 2019-12-27 00:00:00 1
#> 2 2 2019-12-27 00:00:00 3
#> 3 2 2019-12-27 00:55:00 2
#> 4 2 2019-12-27 01:00:00 1
#> 5 2 2019-12-28 01:00:00 1
#> 6 3 2019-12-27 22:00:00 2
#> 7 3 2019-12-27 22:31:00 1
#> 8 3 2019-12-28 14:32:00 1
I've splited data by id and then for each row I've calculated how many date times that come after selected row are in range of 1 hour:
my_data <- tribble(
~id, ~date_time,
1, "2019-12-27 00:00:00",
2, "2019-12-27 00:00:00",
2, "2019-12-27 00:55:00",
2, "2019-12-27 01:00:00",
2, "2019-12-28 01:00:00",
3, "2019-12-27 22:00:10",
3, "2019-12-27 22:31:00",
3, "2019-12-28 14:32:00"
)
my_data <- my_data %>%
mutate(
date_time = lubridate::ymd_hms(date_time)
) %>%
split(.$id) %>%
map(~.x %>% mutate(diff = c(0, diff(date_time)) / 60))
counts <- my_data %>%
map(function(id_data)
map_dbl(seq_len(nrow(id_data)),
~{
start_diff <- id_data %>%
slice(.x) %>%
pluck("diff")
id_data[.x:nrow(id_data),] %>%
filter(diff - start_diff < 1) %>%
nrow()
}
)
)
my_data <- my_data %>%
map2(counts, ~.x %>% mutate(counts = .y)) %>%
bind_rows() %>%
select(-diff)
You just need to tweak the logic of your loop:
res <- data.frame() # empty df for results
for(i in unique(data$id)){
tmp <- data[data$id == i,] # logic is on the Id level
for(r in 1:nrow(tmp)){
tmp <- tmp[ifelse(tmp$date_time <= tmp$date_time[1]+3600,T,F),] # logical test based on 1 hour window
tmp$count[1] <- nrow(tmp) # count
tmp <- tmp[1,] # result is on the row level
res <- rbind(res, tmp) # populate results
}
}
this yields:
> res
id date_time count
1 1 2019-12-27 00:00:00 1
2 2 2019-12-27 00:00:00 3
3 2 2019-12-27 00:00:00 1
4 2 2019-12-27 00:00:00 1
5 2 2019-12-27 00:00:00 1
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:00:00 1
8 3 2019-12-27 22:00:00 1
Here I have an example of the ideal input data for time series analysis:
However I receive the raw data like this:
raw_data <- data.frame(matrix(nrow=4, ncol=5))
colnames(raw_data) <- c("site","date","00:00","01:00","02:00")
raw_data$site <- c("A","B","A","B")
raw_data$date <- c("2015-01-01","2015-01-01","2015-01-02","2015-01-02")
raw_data$`00:00` <- c(1,4,1,4)
raw_data$`01:00` <- c(2,5,2,5)
raw_data$`02:00` <- c(3,6,3,6)
I have spent really a lot of time trying to re-arrange the raw data into the ideal structure. Really appreciate any help. Thanks.
We can use pivot_longer to reshape to 'long' format and then with unite join the columns
library(dplyr)
library(tidyr)
library(lubridate)
raw_data %>%
pivot_longer(cols = matches('^[0-9]'), names_to = 'Time') %>%
unite(DateTime, date, Time, sep=" ") %>%
mutate(DateTime = ymd_hm(DateTime))
# A tibble: 12 x 3
# site DateTime value
# <chr> <dttm> <dbl>
# 1 A 2015-01-01 00:00:00 1
# 2 A 2015-01-01 01:00:00 2
# 3 A 2015-01-01 02:00:00 3
# 4 B 2015-01-01 00:00:00 4
# 5 B 2015-01-01 01:00:00 5
# 6 B 2015-01-01 02:00:00 6
# 7 A 2015-01-02 00:00:00 1
# 8 A 2015-01-02 01:00:00 2
# 9 A 2015-01-02 02:00:00 3
#10 B 2015-01-02 00:00:00 4
#11 B 2015-01-02 01:00:00 5
#12 B 2015-01-02 02:00:00 6
You can do this using melt from the data.table package:
library(data.table)
# Mark the data as a data.table
setDT(raw_data)
# Melt it into long format
new_data <- melt(raw_data, id.vars=c('site', 'date'), variable.name='time')
# Put date and time together into a new column, and delete the old ones
new_data[, `:=`(DateTime = paste(date, time),
date = NULL, time = NULL)]
df is my current dataset and I want to insert dates from 1st Jan'2020 to 4th Jan'2020 for all possible locations .
df<-data.frame(location=c("x","x","y"),date=c("2020-01-01","2020-01-04","2020-01-03"))
This is what my expected dataset look like .
expected_df<-data.frame(location=c("x","x","x","x","y","y","y","y"),date=c("2020-01-01","2020-01-02","2020-01-03","2020-01-04","2020-01-01","2020-01-02","2020-01-03","2020-01-04"))
location date
1 x 2020-01-01
2 x 2020-01-02
3 x 2020-01-03
4 x 2020-01-04
5 y 2020-01-01
6 y 2020-01-02
7 y 2020-01-03
8 y 2020-01-04
We can use complete from tidyr
library(dplyr)
library(tidyr)
start <- as.Date('2020-01-01')
end <- as.Date('2020-01-04')
df %>%
mutate(date = as.Date(date)) %>%
complete(location, date = seq(start, end, by = "1 day"))
# location date
# <fct> <date>
#1 x 2020-01-01
#2 x 2020-01-02
#3 x 2020-01-03
#4 x 2020-01-04
#5 y 2020-01-01
#6 y 2020-01-02
#7 y 2020-01-03
#8 y 2020-01-04
It is essential that you place "stringsAsFactor = FALSE" in your data frame so those values do not get transformed into factors.
df <- data.frame(location=c("x","x","y"), date=c("2020-01-01","2020-01-04","2020-01-03"), stringsAsFactors = F)
'['(
expand.grid(
date = seq.Date(from=min(as.Date(df$date)), to=max(as.Date(df$date)), by = "day"),
location = unique(df$location)
),
c(2,1)
)
Output
location date
1 x 2020-01-01
2 x 2020-01-02
3 x 2020-01-03
4 x 2020-01-04
5 y 2020-01-01
6 y 2020-01-02
7 y 2020-01-03
8 y 2020-01-04
my dataframe looks someway like this
df <- read.table(text="
id start end
1 2 2018-10-01 2018-12-01
2 3 2018-01-01 2018-04-01
", header=TRUE)
What I trying to achieve is get difference between start and day in months for each id and then generate new dataframe with each month for particular id. Result should be
result <- read.table(text="
id date
1 2 2018-10-01
2 2 2018-11-01
3 2 2018-12-01
4 3 2018-01-01
5 3 2018-02-01
6 3 2018-03-01
7 3 2018-04-01
", header=TRUE)
Most straightforward way using base R functions is to create a sequence of monthly dates for each row, create a dataframe and rbind them together
do.call(rbind, with(df,lapply(1:nrow(df), function(i)
data.frame(id = id[i], date = seq(as.Date(start[i]), as.Date(end[i]), by = "month")))))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
We can do this easily with Map. Pass the Date converted 'start' and 'end' columnd of the dataset as arguments to Map, get the sequence of 'month's as a list and expand the 'id' based on the lengths of list as well as concatenate the list elements to create the expanded dataframe
lst1 <- Map(seq, MoreArgs = list(by = 'month'), as.Date(df$start), as.Date(df$end))
data.frame(id = rep(df$id, lengths(lst1)), date = do.call(c, lst1))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
Or using tidyverse, we mutate the class of the 'start', 'end' columns to Date, using map2 (from purrr), get the sequence of dates from 'start' to 'end' by the 'month' and expand the data by unnesting the dataset
library(tidyverse)
df %>%
mutate_at(2:3, as.Date) %>%
transmute(id = id, date = map2(start, end, ~ seq(.x, .y, by = 'month'))) %>%
unnest
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01