was trying to figure a way to use dplyr to count the number of occurrences for each id at each time 1 hour ahead. Tried using a for loop but it doesn't give me the desired result. Went through stack and tried looking for various methods but to no avail. Any advise or help is greatly appreciated. Thanks
Dataset:
https://drive.google.com/file/d/1U186SeBWYyTnJVgUPmow7yknr6K9vu8i/view?usp=sharing
id date_time count
1 1 2019-12-27 00:00:00 NA
2 2 2019-12-27 00:00:00 NA
3 2 2019-12-27 00:55:00 NA
4 2 2019-12-27 01:00:00 NA
5 2 2019-12-28 01:00:00 NA
6 3 2019-12-27 22:00:00 NA
7 3 2019-12-27 22:31:00 NA
8 3 2019-12-28 14:32:00 NA
Desired Output
id date_time count
1 1 2019-12-27 00:00:00 1 #Count = 1 since there is no other cases 1 hour ahead but itself, only 1 case of id=1
2 2 2019-12-27 00:00:00 3 #Count = 3 as there are 3 cases from 00:00 to 01:00 on 27/12
3 2 2019-12-27 00:55:00 2 #Count = 2 as there are 2 cases from 00:55 to 01:55 on 27/12
4 2 2019-12-27 01:00:00 1 #Count = 1 as only itself from 01:00 to 02:00 on 27/12
5 2 2019-12-28 01:00:00 1 #Count = 1 as only itself from 01:00 to 02:00 on 28/12
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:31:00 1
8 3 2019-12-28 14:32:00 1
My codes (I'm stuck):
library(tidyverse)
data <- read.csv('test.csv')
data$date_time <- as.POSIXct(data$date_time)
data$count <- NA
data %>%
group_by(id) %>%
arrange(date_time, .by_group=TRUE)
#Doesn't give the desired output
for (i in 1:nrow(data)){
data$count[i] <- nrow(data[data$date_time<=data$date_time[i]+1*60*60 & data$date_time>=data$date_time[i],])
}
If OP is only looking for tidyverse solution. I am happy to delete this.
Here is an approach using data.table non-equi join:
DT[, onehrlater := date_time + 60*60]
DT[, count :=
DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater),
by=.EACHI, .N]$N
]
How to read this:
1) DT[, onehrlater := date_time + 60*60] creates a new column of POSIX date time that is one hour later. := updates the original dataset by reference.
2) DT[DT, on=.(id, date_time>=date_time, date_time<=onehrlater) performs a self non-equi join such that all rows with i) the same id, ii) date_time after this row's date_time and iii) date_time before this row's date_time one hour later are joined to this row.
3) by=.EACHI, .N returns the count for each of those rows. And $N accesses the output of this self non-equi join. And DT[, count := ...] updates the original dataset by reference.
output:
id date_time onehrlater count
1: 1 2019-12-27 00:00:00 2019-12-27 01:00:00 1
2: 2 2019-12-27 00:00:00 2019-12-27 01:00:00 3
3: 2 2019-12-27 00:55:00 2019-12-27 01:55:00 2
4: 2 2019-12-27 01:00:00 2019-12-27 02:00:00 1
5: 2 2019-12-28 01:00:00 2019-12-28 02:00:00 1
6: 3 2019-12-27 22:00:00 2019-12-27 23:00:00 2
7: 3 2019-12-27 22:31:00 2019-12-27 23:31:00 1
8: 3 2019-12-28 14:32:00 2019-12-28 15:32:00 1
data:
library(data.table)
DT <- fread("id date_time
1 2019-12-27T00:00:00
2 2019-12-27T00:00:00
2 2019-12-27T00:55:00
2 2019-12-27T01:00:00
2 2019-12-28T01:00:00
3 2019-12-27T22:00:00
3 2019-12-27T22:31:00
3 2019-12-28T14:32:00")
DT[, date_time := as.POSIXct(date_time, format="%Y-%m-%dT%T")]
The question can be solved using a non-equi self join (in data.table speak). Unfortunately, this is not yet available with dplyr, AFAIK.
Here is an implementation using SQL:
library(sqldf)
sqldf("
select d1.id, d1.date_time, count(d2.date_time) as count
from dat as d1, dat as d2
where d1.id = d2.id and d1.date_time between d2.date_time and (d2.date_time + 60*60)
group by d2.id, d2.date_time")
id date_time count
1 1 2019-12-27 00:00:00 1
2 2 2019-12-27 00:00:00 3
3 2 2019-12-27 00:55:00 2
4 2 2019-12-27 01:00:00 1
5 2 2019-12-28 01:00:00 1
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:31:00 1
8 3 2019-12-28 14:32:00 1
Data
# reading directly from google drive, see https://stackoverflow.com/a/33142446/3817004
dat <- data.table::fread(
"https://drive.google.com/uc?id=1U186SeBWYyTnJVgUPmow7yknr6K9vu8i&export=download")[
, date_time := anytime::anytime(date_time)]
Maybe fuzzyjoin might be helpful here. You can create time ranges for each row of data (setting the end_time to 3600 seconds or 1 hour after each time). Then, you can do a fuzzy join with itself, where the date_time falls between this range to be counted as within the hour.
library(tidyverse)
library(fuzzyjoin)
df %>%
mutate(row_id = row_number(),
end_time = date_time + 3600) %>%
fuzzy_inner_join(df,
by = c("id", "date_time" = "date_time", "end_time" = "date_time"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(row_id) %>%
summarise(id = first(id.x),
date_time = first(date_time.x),
count = n())
Output
# A tibble: 8 x 4
row_id id date_time count
<int> <int> <dttm> <int>
1 1 1 2019-12-27 00:00:00 1
2 2 2 2019-12-27 00:00:00 3
3 3 2 2019-12-27 00:55:00 2
4 4 2 2019-12-27 01:00:00 1
5 5 2 2019-12-28 01:00:00 1
6 6 3 2019-12-27 22:00:00 2
7 7 3 2019-12-27 22:31:00 1
8 8 3 2019-12-28 14:32:00 1
I'd probably just write a little helper function here along with the split-lapply-bind method rather than group_by:
f <- function(x)
{
sapply(1:nrow(x), function(i) {
y <- as.numeric(difftime(x$date_time, x$date_time[i], units = "min"))
sum(y >= 0 & y <= 60)
})
}
df %>% mutate(count = do.call(c, df %>% split(df$id) %>% lapply(f)))
#> id date_time count
#> 1 1 2019-12-27 00:00:00 1
#> 2 2 2019-12-27 00:00:00 3
#> 3 2 2019-12-27 00:55:00 2
#> 4 2 2019-12-27 01:00:00 1
#> 5 2 2019-12-28 01:00:00 1
#> 6 3 2019-12-27 22:00:00 2
#> 7 3 2019-12-27 22:31:00 1
#> 8 3 2019-12-28 14:32:00 1
I've splited data by id and then for each row I've calculated how many date times that come after selected row are in range of 1 hour:
my_data <- tribble(
~id, ~date_time,
1, "2019-12-27 00:00:00",
2, "2019-12-27 00:00:00",
2, "2019-12-27 00:55:00",
2, "2019-12-27 01:00:00",
2, "2019-12-28 01:00:00",
3, "2019-12-27 22:00:10",
3, "2019-12-27 22:31:00",
3, "2019-12-28 14:32:00"
)
my_data <- my_data %>%
mutate(
date_time = lubridate::ymd_hms(date_time)
) %>%
split(.$id) %>%
map(~.x %>% mutate(diff = c(0, diff(date_time)) / 60))
counts <- my_data %>%
map(function(id_data)
map_dbl(seq_len(nrow(id_data)),
~{
start_diff <- id_data %>%
slice(.x) %>%
pluck("diff")
id_data[.x:nrow(id_data),] %>%
filter(diff - start_diff < 1) %>%
nrow()
}
)
)
my_data <- my_data %>%
map2(counts, ~.x %>% mutate(counts = .y)) %>%
bind_rows() %>%
select(-diff)
You just need to tweak the logic of your loop:
res <- data.frame() # empty df for results
for(i in unique(data$id)){
tmp <- data[data$id == i,] # logic is on the Id level
for(r in 1:nrow(tmp)){
tmp <- tmp[ifelse(tmp$date_time <= tmp$date_time[1]+3600,T,F),] # logical test based on 1 hour window
tmp$count[1] <- nrow(tmp) # count
tmp <- tmp[1,] # result is on the row level
res <- rbind(res, tmp) # populate results
}
}
this yields:
> res
id date_time count
1 1 2019-12-27 00:00:00 1
2 2 2019-12-27 00:00:00 3
3 2 2019-12-27 00:00:00 1
4 2 2019-12-27 00:00:00 1
5 2 2019-12-27 00:00:00 1
6 3 2019-12-27 22:00:00 2
7 3 2019-12-27 22:00:00 1
8 3 2019-12-27 22:00:00 1
Related
I have the following data:
df <- data.frame(dt1 = c("2019-05-02", "2019-01-02", "2019-06-02"),
dt2 = c("2019-08-30", "2019-04-24", "2019-12-06") )
> df
dt1 dt2
1 2019-05-02 2019-08-30
2 2019-01-02 2019-04-24
3 2019-06-02 2019-12-06
Here is what I want to do:
i) I want create factors by binning, for example, for the first date, the dates binned as 2019-07-31, 2019-06-30, 2019-05-31, so essentially binning by dt2.
ii) I want to count the total number of dates in each bin.
The expected output is:
dt1 dt2 val_count
1 2019-05-02 2019-08-30 3
2 2019-01-02 2019-04-24 3
3 2019-06-02 2019-12-06 6
I found this post relevant.
Note: I do not want take difference between months of two dates.
Thank you for suggestions.
It's pretty messy but if you want to count how many last date of the months are in between dt1 and dt2, you may try
library(lubridate)
library(dplyr)
fd <- paste0(lubridate::year(min(df$dt1, df$dt2)), "-02-01") %>% as.Date()
ld <- paste0(lubridate::year(max(df$dt1, df$dt2))+1, "-01-01") %>% as.Date()
x <- seq.Date(fd, ld, by = "month") - 1
df %>%
rowwise() %>%
mutate(val_count = length(x[dt1 < x & x < dt2]))
dt1 dt2 val_count
<chr> <chr> <int>
1 2019-05-02 2019-08-30 3
2 2019-01-02 2019-04-24 3
3 2019-06-02 2019-12-06 6
Choice of < or <= depends on your purpose.
To get total days between dt1 and dt2,
df %>%
rowwise() %>%
mutate(val_count = length(x[dt1 < x & x < dt2])) %>%
mutate(dd = as.Date(dt2) - as.Date(dt1))
dt1 dt2 val_count dd
<chr> <chr> <int> <drtn>
1 2019-05-02 2019-08-30 3 120 days
2 2019-01-02 2019-04-24 3 112 days
3 2019-06-02 2019-12-06 6 187 days
Add
df %>%
rowwise() %>%
mutate(val_count = length(x[dt1 < x & x < dt2]),
val_count = ifelse(val_count == 0, 1, val_count)) %>%
mutate(dd = as.Date(dt2) - as.Date(dt1))
dt1 dt2 val_count dd
<chr> <chr> <dbl> <drtn>
1 2019-05-02 2019-08-30 3 120 days
2 2019-01-02 2019-04-24 3 112 days
3 2019-06-02 2019-12-06 6 187 days
4 2019-06-01 2019-06-02 1 1 days
The above solution is indeed kinda messy, it just takes a simple oneliner to do this
df <- data.frame(dt1 = c("2019-05-02", "2019-01-02", "2019-06-02", "2019-06-01"), dt2 = c("2019-08-30", "2019-04-24", "2019-12-06", "2019-06-02") )
df %>%
mutate(val_count = as.period(ymd(dt2) - ymd(dt1)) %/% months(1))
# dt1 dt2 val_count
# 1 2019-05-02 2019-08-30 3
# 2 2019-01-02 2019-04-24 3
# 3 2019-06-02 2019-12-06 6
# 4 2019-06-01 2019-06-02 0
I have a set of patient ids and date column. I want to update date1 column with -1 day from the date column. for example :
ID Date Date1
1 23-10-2017 23-09-2018
1 24-09-2018 28-08-2019
1 29-08-2019 -
2 30-05-2016 11-06-2017
2 12-06-2017 12-07-2018
2 13-07-2018 -
I don't know if i get what you want. But if you just want a date less one day, this is the code.
x <- data.frame(ID = c(1,1,1,2,2,2), Date = as.Date(c("20-10-2017", "24-09-2018", "29-08-2019", "30-05-2016", "12-06-2017", "13-07-2018"),"%d-%m-%Y"))
x$Date1 <- x$Date-1
Shift by one row by group, then subtract one day:
library(data.table)
dt1 <- fread("
ID Date
1 23-10-2017
1 24-09-2018
1 29-08-2019
2 30-05-2016
2 12-06-2017
2 13-07-2018")
# convert to date
dt1[, Date := as.Date(Date, "%d-%m-%y")]
# shift per group, then minus 1 day
dt1[, Date1 := shift(Date, - 1) - 1, by = ID]
dt1
# ID Date Date1
# 1: 1 2020-10-23 2020-09-23
# 2: 1 2020-09-24 2020-08-28
# 3: 1 2020-08-29 <NA>
# 4: 2 2020-05-30 2020-06-11
# 5: 2 2020-06-12 2020-07-12
# 6: 2 2020-07-13 <NA>
Try using lead:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Date1 = lead(Date)-1)
# A tibble: 6 x 3
# Groups: ID [2]
ID Date Date1
<int> <date> <date>
1 1 2017-10-23 2018-09-23
2 1 2018-09-24 2019-08-28
3 1 2019-08-29 NA
4 2 2016-05-30 2017-06-11
5 2 2017-06-12 2018-07-12
6 2 2018-07-13 NA
my dataframe looks someway like this
df <- read.table(text="
id start end
1 2 2018-10-01 2018-12-01
2 3 2018-01-01 2018-04-01
", header=TRUE)
What I trying to achieve is get difference between start and day in months for each id and then generate new dataframe with each month for particular id. Result should be
result <- read.table(text="
id date
1 2 2018-10-01
2 2 2018-11-01
3 2 2018-12-01
4 3 2018-01-01
5 3 2018-02-01
6 3 2018-03-01
7 3 2018-04-01
", header=TRUE)
Most straightforward way using base R functions is to create a sequence of monthly dates for each row, create a dataframe and rbind them together
do.call(rbind, with(df,lapply(1:nrow(df), function(i)
data.frame(id = id[i], date = seq(as.Date(start[i]), as.Date(end[i]), by = "month")))))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
We can do this easily with Map. Pass the Date converted 'start' and 'end' columnd of the dataset as arguments to Map, get the sequence of 'month's as a list and expand the 'id' based on the lengths of list as well as concatenate the list elements to create the expanded dataframe
lst1 <- Map(seq, MoreArgs = list(by = 'month'), as.Date(df$start), as.Date(df$end))
data.frame(id = rep(df$id, lengths(lst1)), date = do.call(c, lst1))
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
Or using tidyverse, we mutate the class of the 'start', 'end' columns to Date, using map2 (from purrr), get the sequence of dates from 'start' to 'end' by the 'month' and expand the data by unnesting the dataset
library(tidyverse)
df %>%
mutate_at(2:3, as.Date) %>%
transmute(id = id, date = map2(start, end, ~ seq(.x, .y, by = 'month'))) %>%
unnest
# id date
#1 2 2018-10-01
#2 2 2018-11-01
#3 2 2018-12-01
#4 3 2018-01-01
#5 3 2018-02-01
#6 3 2018-03-01
#7 3 2018-04-01
I have a data frame with a specific variable (Var1) and a time variable (Var2).
I would like to calculate the frequency of occurrence (Frequency) of Var1 withing a specific time step (let say 1 min) during a year.
sample dataset:
Var1 <- c(rep("A", 4), rep("B", 3), rep("C", 2))
Var2 <- c("2018-09-01 10:00:00", "2018-09-01 10:00:30", "2018-09-01 10:00:45",
"2018-09-10 22:10:00", "2017-09-05 10:54:30", "2018-12-15 10:00:30",
"2018-12-15 10:01:00", "2017-02-20 17:16:30", "2017-12-20 20:08:56")
df <- data.frame(Var1, Var2)
df$Var2 <- as.POSIXct(df$Var2)
desired output:
Frequency <- c(rep(3, 3), rep(1, 2), rep(2,2), rep(1,2))
dfOut <- data.frame(Var1, Var2, Frequency)
# Var1 Var2 Frequency
#1 A 2018-09-01 10:00:00 3
#2 A 2018-09-01 10:00:30 3
#3 A 2018-09-01 10:00:45 3
#4 A 2018-09-10 22:10:00 1
#5 B 2017-09-05 10:54:30 1
#6 B 2018-12-15 10:00:30 2
#7 B 2018-12-15 10:01:00 2
#8 C 2017-02-20 17:16:30 1
#9 C 2017-12-20 20:08:56 1
You can use lubridate::floor_date to get the minute grouping column that accounts for date as you are describing. Note that your displayed desired output does not seem to match your comment
Var1 <- c(rep("A", 4), rep("B", 3), rep("C", 2))
Var2 <- c("2018-09-01 10:00:00", "2018-09-01 10:00:30", "2018-09-01 10:00:45",
"2018-09-10 22:10:00", "2017-09-05 10:54:30", "2018-12-15 10:00:30",
"2018-12-15 10:01:00", "2017-02-20 17:16:30", "2017-12-20 20:08:56")
df <- data.frame(Var1, Var2)
df$Var2 <- as.POSIXct(df$Var2)
library(tidyverse)
library(lubridate)
df %>%
mutate(minute = floor_date(Var2, unit = "minute")) %>%
add_count(Var1, minute)
#> # A tibble: 9 x 4
#> Var1 Var2 minute n
#> <fct> <dttm> <dttm> <int>
#> 1 A 2018-09-01 10:00:00 2018-09-01 10:00:00 3
#> 2 A 2018-09-01 10:00:30 2018-09-01 10:00:00 3
#> 3 A 2018-09-01 10:00:45 2018-09-01 10:00:00 3
#> 4 A 2018-09-10 22:10:00 2018-09-10 22:10:00 1
#> 5 B 2017-09-05 10:54:30 2017-09-05 10:54:00 1
#> 6 B 2018-12-15 10:00:30 2018-12-15 10:00:00 1
#> 7 B 2018-12-15 10:01:00 2018-12-15 10:01:00 1
#> 8 C 2017-02-20 17:16:30 2017-02-20 17:16:00 1
#> 9 C 2017-12-20 20:08:56 2017-12-20 20:08:00 1
Created on 2018-09-11 by the reprex package (v0.2.0).
You can do something like this. Create a new character vector to define the groups, then group by Var1 and the new variable. This doesn't give exactly your desired output because the minutes are defined differently.
library(dplyr)
df %>%
mutate(minute = substring(as.character(Var2), 1, 16)) %>%
group_by(Var1, minute) %>%
mutate(frequency = n())
Here is a data.table approach. You can first create an index showing if the datetime for next row is 1 min after the datetime of current row. Then, use this as one of the grouping criteria to calculate the frequency.
library(data.table)
setDT(df)[, idx := cumsum(c(0L, Var2[-1L] > Var2[-.N] + 60L)), by=.(Var1)][,
Freq := .N, by=.(Var1, idx)]
output:
Var1 Var2 idx Freq
1: A 2018-09-01 10:00:00 0 3
2: A 2018-09-01 10:00:30 0 3
3: A 2018-09-01 10:00:45 0 3
4: A 2018-09-10 22:10:00 1 1
5: B 2017-09-05 10:54:30 0 1
6: B 2018-12-15 10:00:30 1 2
7: B 2018-12-15 10:01:00 1 2
8: C 2017-02-20 17:16:30 0 1
9: C 2017-12-20 20:08:56 1 1
I have the following situation.
df <- rbind(
data.frame(thisDate = rep(seq(as.Date("2018-1-1"), as.Date("2018-1-2"), by="day")) ),
data.frame(thisDate = rep(seq(as.Date("2018-2-1"), as.Date("2018-2-2"), by="day")) ))
df <- cbind(df,lastMonth = as.Date(format(as.Date(df$thisDate - months(1)),"%Y-%m-01")))
df <- cbind(df, prod1Quantity= seq(1:4) )
I have quantities for different days of a month for an unknown number of products. I want to have 1 column for every product with the total monthly quantity of that product for all of the previous month. So the output would be like this .. ie grouped by lastMonth, Prod1Quantity . I just don't get how to group by, mutate and summarise dynamically if that indeed is the right approach.
I came across data.table generate multiple columns and summarize them . I think it appears to do what I need - but I just don't get how it is working!
Desired Output
thisDate lastMonth prod1Quantity prod1prevMonth
1 2018-01-01 2017-12-01 1 NA
2 2018-01-02 2017-12-01 2 NA
3 2018-02-01 2018-01-01 3 3
4 2018-02-02 2018-01-01 4 3
Another approach could be
library(dplyr)
library(lubridate)
temp_df <- df %>%
mutate(thisDate_forJoin = as.Date(format(thisDate,"%Y-%m-01")))
final_df <- temp_df %>%
mutate(thisDate_forJoin = thisDate_forJoin %m-% months(1)) %>%
left_join(temp_df %>%
group_by(thisDate_forJoin) %>%
summarise_if(is.numeric, sum),
by="thisDate_forJoin") %>%
select(-thisDate_forJoin)
Output is:
thisDate prod1Quantity.x prod2Quantity.x prod1Quantity.y prod2Quantity.y
1 2018-01-01 1 10 NA NA
2 2018-01-02 2 11 NA NA
3 2018-02-01 3 12 3 21
4 2018-02-02 4 13 3 21
Sample data:
df <- structure(list(thisDate = structure(c(17532, 17533, 17563, 17564
), class = "Date"), prod1Quantity = 1:4, prod2Quantity = 10:13), class = "data.frame", row.names = c(NA,
-4L))
# thisDate prod1Quantity prod2Quantity
#1 2018-01-01 1 10
#2 2018-01-02 2 11
#3 2018-02-01 3 12
#4 2018-02-02 4 13
A solution can be reached by calculating the month-wise production quantity and then joining on month of lastMonth and thisDate.
lubridate::month function has been used evaluate month from date.
library(dplyr)
library(lubridate)
df %>% group_by(month = as.integer(month(thisDate))) %>%
summarise(prodQuantMonth = sum(prod1Quantity)) %>%
right_join(., mutate(df, prevMonth = month(lastMonth)), by=c("month" = "prevMonth")) %>%
select(thisDate, lastMonth, prod1Quantity, prodQuantLastMonth = prodQuantMonth)
# # A tibble: 4 x 4
# thisDate lastMonth prod1Quantity prodQuantLastMonth
# <date> <date> <int> <int>
# 1 2018-01-01 2017-12-01 1 NA
# 2 2018-01-02 2017-12-01 2 NA
# 3 2018-02-01 2018-01-01 3 3
# 4 2018-02-02 2018-01-01 4 3