generate rows for each id by date sequence - r

my dataframe looks someway like this
df <- read.table(text="
id start end
1 2 2018-10-01 2018-12-01
2 3 2018-01-01 2018-04-01
", header=TRUE)
What I trying to achieve is get difference between start and day in months for each id and then generate new dataframe with each month for particular id. Result should be
result <- read.table(text="
id date
1 2 2018-10-01
2 2 2018-11-01
3 2 2018-12-01
4 3 2018-01-01
5 3 2018-02-01
6 3 2018-03-01
7 3 2018-04-01
", header=TRUE)

Most straightforward way using base R functions is to create a sequence of monthly dates for each row, create a dataframe and rbind them together
do.call(rbind, with(df,lapply(1:nrow(df), function(i)
data.frame(id = id[i], date = seq(as.Date(start[i]), as.Date(end[i]), by = "month")))))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01

We can do this easily with Map. Pass the Date converted 'start' and 'end' columnd of the dataset as arguments to Map, get the sequence of 'month's as a list and expand the 'id' based on the lengths of list as well as concatenate the list elements to create the expanded dataframe
lst1 <- Map(seq, MoreArgs = list(by = 'month'), as.Date(df$start), as.Date(df$end))
data.frame(id = rep(df$id, lengths(lst1)), date = do.call(c, lst1))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
Or using tidyverse, we mutate the class of the 'start', 'end' columns to Date, using map2 (from purrr), get the sequence of dates from 'start' to 'end' by the 'month' and expand the data by unnesting the dataset
library(tidyverse)
df %>%
mutate_at(2:3, as.Date) %>%
transmute(id = id, date = map2(start, end, ~ seq(.x, .y, by = 'month'))) %>%
unnest
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01

Related

R number of rows in each day [duplicate]

This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
I have a data frame with time windows on each row. The time window is identified by a start_date and end _date for each ID.
For each calendar day, I would like to know how may IDs have a time window spanning that day.
Example data
data <- data.frame(
id = c("A","B","C"),
start_date = as.POSIXct(c("2020-01-01 01:00:00", "2020-01-02 01:00:00", "2020-01-03 01:00:00")),
end_date = as.POSIXct(c("2020-01-04 01:00:00", "2020-01-03 01:00:00", "2020-01-06 01:00:00")),
stringsAsFactors = FALSE
)
data
id start_date end_date
1 A 2020-01-01 01:00:00 2020-01-04 01:00:00
2 B 2020-01-02 01:00:00 2020-01-03 01:00:00
3 C 2020-01-03 01:00:00 2020-01-06 01:00:00
The output I am looking for is to aggregate this into days with number of IDs present on each day.
day number_of_ids
2020-01-01 1
2020-01-02 2
2020-01-03 3
2020-01-04 2
2020-01-05 1
2020-01-06 1
Any help much appreciated.
We get the sequence of dates between corresponding 'start_date', 'end_date' in a list column, unnest the list column, then do a group by 'day' and get the number of distinct 'id' with n_distinct in summarise
library(dplyr)
library(purrr)
library(tidyr)
data %>%
transmute(id, day = map2(as.Date(start_date), as.Date(end_date),
~ seq(.x, .y, by = 'day'))) %>%
unnest(c(day)) %>%
group_by(day) %>%
summarise(number_of_ids = n_distinct(id))
# A tibble: 6 x 2
# day number_of_ids
# <date> <int>
#1 2020-01-01 1
#2 2020-01-02 2
#3 2020-01-03 3
#4 2020-01-04 2
#5 2020-01-05 1
#6 2020-01-06 1
In base R you could do:
a <- with(data, setNames(Map( function(x, y) format(seq(x,y,'1 day'), '%F'), start_date, end_date),id))
aggregate(ind~values, stack(a), length)
values ind
1 2020-01-01 1
2 2020-01-02 2
3 2020-01-03 3
4 2020-01-04 2
5 2020-01-05 1
6 2020-01-06 1

R: Using dplyr to count number of occurence 1 hour ahead

was trying to figure a way to use dplyr to count the number of occurrences for each id at each time 1 hour ahead. Tried using a for loop but it doesn't give me the desired result. Went through stack and tried looking for various methods but to no avail. Any advise or help is greatly appreciated. Thanks
Dataset:
https://drive.google.com/file/d/1U186SeBWYyTnJVgUPmow7yknr6K9vu8i/view?usp=sharing
id date_time count
1 1 2019-12-27 00:00:00 NA
2 2 2019-12-27 00:00:00 NA
3 2 2019-12-27 00:55:00 NA
4 2 2019-12-27 01:00:00 NA
5 2 2019-12-28 01:00:00 NA
6 3 2019-12-27 22:00:00 NA
7 3 2019-12-27 22:31:00 NA
8 3 2019-12-28 14:32:00 NA
Desired Output
id date_time count
1 1 2019-12-27 00:00:00 1 #Count = 1 since there is no other cases 1 hour ahead but itself, only 1 case of id=1
2 2 2019-12-27 00:00:00 3 #Count = 3 as there are 3 cases from 00:00 to 01:00 on 27/12
3 2 2019-12-27 00:55:00 2 #Count = 2 as there are 2 cases from 00:55 to 01:55 on 27/12
4 2 2019-12-27 01:00:00 1 #Count = 1 as only itself from 01:00 to 02:00 on 27/12
5 2 2019-12-28 01:00:00 1 #Count = 1 as only itself from 01:00 to 02:00 on 28/12
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:31:00 1
8 3 2019-12-28 14:32:00 1
My codes (I'm stuck):
library(tidyverse)
data <- read.csv('test.csv')
data$date_time <- as.POSIXct(data$date_time)
data$count <- NA
data %>%
group_by(id) %>%
arrange(date_time, .by_group=TRUE)
#Doesn't give the desired output
for (i in 1:nrow(data)){
data$count[i] <- nrow(data[data$date_time<=data$date_time[i]+1*60*60 & data$date_time>=data$date_time[i],])
}
If OP is only looking for tidyverse solution. I am happy to delete this.
Here is an approach using data.table non-equi join:
DT[, onehrlater := date_time + 60*60]
DT[, count :=
DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater),
by=.EACHI, .N]$N
]
How to read this:
1) DT[, onehrlater := date_time + 60*60] creates a new column of POSIX date time that is one hour later. := updates the original dataset by reference.
2) DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater) performs a self non-equi join such that all rows with i) the same id, ii) date_time after this row's date_time and iii) date_time before this row's date_time one hour later are joined to this row.
3) by=.EACHI, .N returns the count for each of those rows. And $N accesses the output of this self non-equi join. And DT[, count := ...] updates the original dataset by reference.
output:
id date_time onehrlater count
1: 1 2019-12-27 00:00:00 2019-12-27 01:00:00 1
2: 2 2019-12-27 00:00:00 2019-12-27 01:00:00 3
3: 2 2019-12-27 00:55:00 2019-12-27 01:55:00 2
4: 2 2019-12-27 01:00:00 2019-12-27 02:00:00 1
5: 2 2019-12-28 01:00:00 2019-12-28 02:00:00 1
6: 3 2019-12-27 22:00:00 2019-12-27 23:00:00 2
7: 3 2019-12-27 22:31:00 2019-12-27 23:31:00 1
8: 3 2019-12-28 14:32:00 2019-12-28 15:32:00 1
data:
library(data.table)
DT <- fread("id date_time
1 2019-12-27T00:00:00
2 2019-12-27T00:00:00
2 2019-12-27T00:55:00
2 2019-12-27T01:00:00
2 2019-12-28T01:00:00
3 2019-12-27T22:00:00
3 2019-12-27T22:31:00
3 2019-12-28T14:32:00")
DT[, date_time := as.POSIXct(date_time, format="%Y-%m-%dT%T")]
The question can be solved using a non-equi self join (in data.table speak). Unfortunately, this is not yet available with dplyr, AFAIK.
Here is an implementation using SQL:
library(sqldf)
sqldf("
select d1.id, d1.date_time, count(d2.date_time) as count
from dat as d1, dat as d2
where d1.id = d2.id and d1.date_time between d2.date_time and (d2.date_time + 60*60)
group by d2.id, d2.date_time")
id date_time count
1 1 2019-12-27 00:00:00 1
2 2 2019-12-27 00:00:00 3
3 2 2019-12-27 00:55:00 2
4 2 2019-12-27 01:00:00 1
5 2 2019-12-28 01:00:00 1
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:31:00 1
8 3 2019-12-28 14:32:00 1
Data
# reading directly from google drive, see https://stackoverflow.com/a/33142446/3817004
dat <- data.table::fread(
"https://drive.google.com/uc?id=1U186SeBWYyTnJVgUPmow7yknr6K9vu8i&export=download")[
, date_time := anytime::anytime(date_time)]
Maybe fuzzyjoin might be helpful here. You can create time ranges for each row of data (setting the end_time to 3600 seconds or 1 hour after each time). Then, you can do a fuzzy join with itself, where the date_time falls between this range to be counted as within the hour.
library(tidyverse)
library(fuzzyjoin)
df %>%
mutate(row_id = row_number(),
end_time = date_time + 3600) %>%
fuzzy_inner_join(df,
by = c("id", "date_time" = "date_time", "end_time" = "date_time"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(row_id) %>%
summarise(id = first(id.x),
date_time = first(date_time.x),
count = n())
Output
# A tibble: 8 x 4
row_id id date_time count
<int> <int> <dttm> <int>
1 1 1 2019-12-27 00:00:00 1
2 2 2 2019-12-27 00:00:00 3
3 3 2 2019-12-27 00:55:00 2
4 4 2 2019-12-27 01:00:00 1
5 5 2 2019-12-28 01:00:00 1
6 6 3 2019-12-27 22:00:00 2
7 7 3 2019-12-27 22:31:00 1
8 8 3 2019-12-28 14:32:00 1
I'd probably just write a little helper function here along with the split-lapply-bind method rather than group_by:
f <- function(x)
{
sapply(1:nrow(x), function(i) {
y <- as.numeric(difftime(x$date_time, x$date_time[i], units = "min"))
sum(y >= 0 & y <= 60)
})
}
df %>% mutate(count = do.call(c, df %>% split(df$id) %>% lapply(f)))
#> id date_time count
#> 1 1 2019-12-27 00:00:00 1
#> 2 2 2019-12-27 00:00:00 3
#> 3 2 2019-12-27 00:55:00 2
#> 4 2 2019-12-27 01:00:00 1
#> 5 2 2019-12-28 01:00:00 1
#> 6 3 2019-12-27 22:00:00 2
#> 7 3 2019-12-27 22:31:00 1
#> 8 3 2019-12-28 14:32:00 1
I've splited data by id and then for each row I've calculated how many date times that come after selected row are in range of 1 hour:
my_data <- tribble(
~id, ~date_time,
1, "2019-12-27 00:00:00",
2, "2019-12-27 00:00:00",
2, "2019-12-27 00:55:00",
2, "2019-12-27 01:00:00",
2, "2019-12-28 01:00:00",
3, "2019-12-27 22:00:10",
3, "2019-12-27 22:31:00",
3, "2019-12-28 14:32:00"
)
my_data <- my_data %>%
mutate(
date_time = lubridate::ymd_hms(date_time)
) %>%
split(.$id) %>%
map(~.x %>% mutate(diff = c(0, diff(date_time)) / 60))
counts <- my_data %>%
map(function(id_data)
map_dbl(seq_len(nrow(id_data)),
~{
start_diff <- id_data %>%
slice(.x) %>%
pluck("diff")
id_data[.x:nrow(id_data),] %>%
filter(diff - start_diff < 1) %>%
nrow()
}
)
)
my_data <- my_data %>%
map2(counts, ~.x %>% mutate(counts = .y)) %>%
bind_rows() %>%
select(-diff)
You just need to tweak the logic of your loop:
res <- data.frame() # empty df for results
for(i in unique(data$id)){
tmp <- data[data$id == i,] # logic is on the Id level
for(r in 1:nrow(tmp)){
tmp <- tmp[ifelse(tmp$date_time <= tmp$date_time[1]+3600,T,F),] # logical test based on 1 hour window
tmp$count[1] <- nrow(tmp) # count
tmp <- tmp[1,] # result is on the row level
res <- rbind(res, tmp) # populate results
}
}
this yields:
> res
id date_time count
1 1 2019-12-27 00:00:00 1
2 2 2019-12-27 00:00:00 3
3 2 2019-12-27 00:00:00 1
4 2 2019-12-27 00:00:00 1
5 2 2019-12-27 00:00:00 1
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:00:00 1
8 3 2019-12-27 22:00:00 1

Fixing the error associated with creating a sequence of dates between start date and end date in R

***NOTE: please do not link a similar post. I have found several other similar postings, but their responses have not resolved the errors I get: "Error in seq.int(0, to0 - from, by) : 'to' must be a finite number" or "from must be of length 1".... I'm looking to understand why these error statements occur and how to prevent them from occuring... Thanks!
I have a data frame like the following
id startdate enddate
1 01/01/2011 01/05/2011
1 02/03/2012 02/05/2012
2 03/04/2013 03/06/2013
3 04/06/2014 04/09/2014
I want to transform the data frame so as to create the following:
id date
1 01/01/2011
1 01/02/2011
1 01/03/2011
1 01/04/2011
1 01/05/2011
1 02/03/2012
1 02/04/2012
1 02/05/2012
2 03/04/2013
2 03/05/2013
2 03/06/2013
.... and so on to fill in the sequence of dates between startdate and enddate
I have tried the following....
one<-as.data.table(one)
one[, startdate:=as.character(startdate)]
one[, enddate:=as.character(enddate)]
one[, startdate:=as.Date(startdate, format="%m/%d/%Y")]
one[, enddate:=as.Date(enddate, format="%m/%d/%Y")]
one<-as.data.frame(one)
one%>%
rowwise() %>%
do(data.frame(id=.$id, date=seq(.$startdate,.$enddate,by="day")))
When I run this, I get the following error: Error in seq.int(0, to0 - from, by) : 'to' must be a finite number
Why is this? And how can I fix this piece of code?
With data.table, we can use Map. Convert the 'startdate' 'enddate' to Date class, use Map to get the sequence of corresponding elements, replicate the 'id' based on the lengths of the list output of dates, concatenate the list of dates to create the two column output
library(data.table)
one[, {lst1 <- Map(seq, as.IDate(startdate, "%m/%d/%Y"),
as.IDate(enddate, "%m/%d/%Y"),
MoreArgs = list(by = "day"))
.(id = rep(id, lengths(lst1)), date = do.call(c, lst1))}]
# id date
# 1: 1 2011-01-01
# 2: 1 2011-01-02
# 3: 1 2011-01-03
# 4: 1 2011-01-04
# 5: 1 2011-01-05
# 6: 1 2012-02-03
# 7: 1 2012-02-04
# 8: 1 2012-02-05
# 9: 2 2013-03-04
#10: 2 2013-03-05
#11: 2 2013-03-06
#12: 3 2014-04-06
#13: 3 2014-04-07
#14: 3 2014-04-08
#15: 3 2014-04-09
If there are multiple formats in 'date' columns, one option is anydate from anytime to automatically convert some of the formats to Date class
library(anytime)
one[, {lst1 <- Map(seq, anydate(startdate),
anydate(enddate),
MoreArgs = list(by = "day"))
.(id = rep(id, lengths(lst1)), date = do.call(c, lst1))}]
Or using tidyverse
library(dplyr)
library(purrr)
library(tidyr)
library(lubridate)
one %>%
transmute(id, date = map2(mdy(startdate), mdy(enddate), seq, by = 'day')) %>%
unnest(c(date))
# A tibble: 15 x 2
# id date
# <int> <date>
# 1 1 2011-01-01
# 2 1 2011-01-02
# 3 1 2011-01-03
# 4 1 2011-01-04
# 5 1 2011-01-05
# 6 1 2012-02-03
# 7 1 2012-02-04
# 8 1 2012-02-05
# 9 2 2013-03-04
#10 2 2013-03-05
#11 2 2013-03-06
#12 3 2014-04-06
#13 3 2014-04-07
#14 3 2014-04-08
#15 3 2014-04-09
data
one <- structure(list(id = c(1L, 1L, 2L, 3L), startdate = c("01/01/2011",
"02/03/2012", "03/04/2013", "04/06/2014"), enddate = c("01/05/2011",
"02/05/2012", "03/06/2013", "04/09/2014")), class = "data.frame", row.names = c(NA,
-4L))
setDT(one)

Find Consecutive Occurrences within a Group (of IDs)

My data set looks like this:
ID start.date end.date program
1 2016.05.05 2017.05.05 A
1 2017.05.06 2019.06.16 A
2 2012.06.05 2013.06.18 B
3 2014.09.09 2017.07.01 B
3 2017.09.09 2018.09.09 B
I want to identify the people who were present in a program (character variable) consecutively, and then calculate the time between each end.date and start.date (if the occurrence was consecutive).
So the resulting data should look like this:
ID start.date end.date program days
1 2016.05.05 2017.05.05 A NA
1 2017.05.06 2019.06.16 A . 1
2 2012.06.05 2013.06.18 B . NA
3 2014.09.09 2017.07.01 B . NA
3 2017.09.09 2018.09.09 B . 63
Don't know how to start on this!
library(dplyr)
dat %>%
group_by(ID, program) %>%
arrange(start.date) %>% # Added in case the data isn't sorted
mutate(days = start.date - lag(end.date))
I get slightly different results, though:
# A tibble: 5 x 5
# Groups: ID, program [3]
ID start.date end.date program days
<int> <date> <date> <chr> <time>
1 1 2016-05-05 2017-05-05 A NA
2 1 2017-05-06 2019-06-16 A 1
3 2 2012-06-05 2013-06-18 B NA
4 3 2014-09-09 2017-07-01 B NA
5 3 2017-09-09 2018-09-09 B 70
To bring the data in, I converted to dates:
dat <- read.table(header = T, stringsAsFactors = F,
text = "ID start.date end.date program
1 2016.05.05 2017.05.05 A
1 2017.05.06 2019.06.16 A
2 2012.06.05 2013.06.18 B
3 2014.09.09 2017.07.01 B
3 2017.09.09 2018.09.09 B") %>%
mutate_at(vars(matches("date")), lubridate::ymd)

summarize multiple dynamic columns and store results in new columns

I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3
Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13
A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3

Resources