I have a data.table with the following shape:
date_from date_until value
2015-01-01 2015-01-03 100
2015-01-02 2015-01-05 50
2015-01-02 2015-01-04 10
...
What I want to do is: I want to calculate for every date in the year the cumulative sum. For the first row the value 100 would be relevant for every day from 2015-01-01 until 2015-01-03. I want to add all values which are relevant for a certain date.
So, in the end there would be a data.table like this:
date value
2015-01-01 100
2015-01-02 160
2015-01-03 160
2015-01-04 60
2015-01-05 50
Is there any easy way with the data.table to do this?
dt[, .(date = seq(as.Date(date_from, '%Y-%m-%d'),
as.Date(date_until, '%Y-%m-%d'),
by='1 day'),
value), by = 1:nrow(dt)][, sum(value), by = date]
# date V1
#1: 2015-01-01 100
#2: 2015-01-02 160
#3: 2015-01-03 160
#4: 2015-01-04 60
#5: 2015-01-05 50
And another option using foverlaps:
# convert to Date for ease
dt[, date_from := as.Date(date_from, '%Y-%m-%d')]
dt[, date_until := as.Date(date_until, '%Y-%m-%d')]
# all of the dates
alldates = dt[, do.call(seq, c(as.list(range(c(date_from, date_until))), by = '1 day'))]
# foverlaps to find the intersections
foverlaps(dt, data.table(date_from = alldates, date_until = alldates,
key = c('date_from', 'date_until')))[,
sum(value), by = date_from]
# date_from V1
#1: 2015-01-01 100
#2: 2015-01-02 160
#3: 2015-01-03 160
#4: 2015-01-04 60
#5: 2015-01-05 50
Related
I am using a dataset which is grouped by group_by function of dplyr package.
Each Group has it's own time index which i.e. supposedly consist of 12 months sequences.
This means that it can start from January and end up in December or in other cases it can start from June of the year before and end up in May next year.
Here is the dataset example:
ID DATE
8 2017-01-31
8 2017-02-28
8 2017-03-31
8 2017-04-30
8 2017-05-31
8 2017-06-30
8 2017-07-31
8 2017-08-31
8 2017-09-30
8 2017-10-31
8 2017-11-30
8 2017-12-31
32 2017-01-31
32 2017-02-28
32 2017-03-31
32 2017-04-30
32 2017-05-31
32 2017-06-30
32 2017-07-31
32 2017-08-31
32 2017-09-30
32 2017-10-31
32 2017-11-30
32 2017-12-31
45 2016-09-30
45 2016-10-31
45 2016-11-30
45 2016-12-31
45 2017-01-31
45 2017-02-28
45 2017-03-31
45 2017-04-30
45 2017-05-31
45 2017-06-30
45 2017-07-31
45 2017-08-31
The Problem is that I can't confirm or validate visualy because of dataset dimensions if there are so called "jumps", in other words if dates are consistent. Is there any simple way in r to do that, perhaps some modification/combination of functions from tibbletime package.
Any help will by appreciated.
Thank you in advance.
Here's how I would typically approach this problem using data.table -- the cut.Date() and seq.Date() functions from base are the meat of the logic, so you use the same approach with dplyr if desired.
library(data.table)
## Convert to data.table
setDT(df)
## Convert DATE to a date in case it wasn't already
df[,DATE := as.Date(DATE)]
## Order by ID and Date
setkey(df,ID,DATE)
## Create a column with the month of each date
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
## Generate a sequence of Dates by month for the number of observations
## in each group -- .N
df[,ExpectedMonth := seq.Date(from = min(Month),
by = "months",
length.out = .N), by = .(ID)]
## Create a summary table to test whether an ID had 12 observations where
## the actual month was equal to the expected month
Test <- df[Month == ExpectedMonth, .(Valid = ifelse(.N == 12L,TRUE,FALSE)), by = .(ID)]
print(Test)
# ID Valid
# 1: 8 TRUE
# 2: 32 TRUE
# 3: 45 TRUE
## Do a no-copy join of Test to df based on ID
## and create a column in df based on the 'Valid' column in Test
df[Test, Valid := i.Valid, on = "ID"]
## The final output:
head(df)
# ID DATE Month ExpectedMonth Valid
# 1: 8 2017-01-31 2017-01-01 2017-01-01 TRUE
# 2: 8 2017-02-28 2017-02-01 2017-02-01 TRUE
# 3: 8 2017-03-31 2017-03-01 2017-03-01 TRUE
# 4: 8 2017-04-30 2017-04-01 2017-04-01 TRUE
# 5: 8 2017-05-31 2017-05-01 2017-05-01 TRUE
# 6: 8 2017-06-30 2017-06-01 2017-06-01 TRUE
You could also do things a little more compactly if you really wanted to using a self-join and skip creating Test
setDT(df)
df[,DATE := as.Date(DATE)]
setkey(df,ID,DATE)
df[,Month := as.Date(cut.Date(DATE, breaks = "months"))]
df[,ExpectedMonth := seq.Date(from = min(Month), by = "months", length.out = .N), keyby = .(ID)]
df[df[Month == ExpectedMonth,.(Valid = ifelse(.N == 12L,TRUE,FALSE)),keyby = .(ID)], Valid := i.Valid]
You can use the summarise function from dplyr to return a logical value of whether there are any day differences greater than 31 within each ID. You do this by first constructing a temporary date using only the year and month and attaching "-01" as the fake day:
library(dplyr)
library(lubridate)
df %>%
group_by(ID) %>%
mutate(DATE2 = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
DATE_diff = c(0, diff(DATE2))) %>%
summarise(Valid = !any(DATE_diff > 31))
Result:
# A tibble: 3 x 2
ID Valid
<int> <lgl>
1 8 TRUE
2 32 TRUE
3 45 TRUE
You can also visually check if there are any gaps by plotting your dates for each ID:
library(ggplot2)
df %>%
mutate(DATE = ymd(paste0(sub('\\-\\d+$', '', DATE),'-01')),
ID = as.factor(ID)) %>%
ggplot(aes(x = DATE, y = ID, group = ID)) +
geom_point(aes(color = ID)) +
scale_x_date(date_breaks = "1 month",
date_labels = "%b-%Y") +
labs(title = "Time Line by ID")
Suppose I have two data.tables:
summary <- data.table(period = c("A","B","C","D"),
from_date = ymd(c("2017-01-01", "2017-01-03", "2017-02-08", "2017-03-07")),
to_date = ymd(c("2017-01-31", "2017-04-01", "2017-03-08", "2017-05-01"))
)
log <- data.table(date = ymd(c("2017-01-03","2017-01-20","2017-02-01","2017-03-03",
"2017-03-15","2017-03-28","2017-04-03","2017-04-23")),
event1 = c(4,8,8,4,3,4,7,3), event2 = c(1,8,7,3,8,4,6,3))
which look like this:
> summary
period from_date to_date
1: A 2017-01-01 2017-01-31
2: B 2017-01-03 2017-04-01
3: C 2017-02-08 2017-03-08
4: D 2017-03-07 2017-05-01
> log
date event1 event2
1: 2017-01-03 4 1
2: 2017-01-20 8 8
3: 2017-02-01 8 7
4: 2017-03-03 4 3
5: 2017-03-15 3 8
6: 2017-03-28 4 4
7: 2017-04-03 7 6
8: 2017-04-23 3 3
I would like to get the sum of event1 and event2 for each time period in the table summary.
I know I can do this:
summary[, c("event1","event2") := .(sum(log[date>=from_date & date<=to_date, event1]),
sum(log[date>=from_date & date<=to_date, event2]))
, by=period][]
to get the desired result:
period from_date to_date event1 event2
1: A 2017-01-01 2017-01-31 12 9
2: B 2017-01-03 2017-04-01 31 31
3: C 2017-02-08 2017-03-08 4 3
4: D 2017-03-07 2017-05-01 17 21
Now, in my real-life problem, I have about 30 columns to be summed, which I may want to change later, and summary has ~35,000 rows, log has ~40,000,000 rows. Is there an efficient way to achieve this?
Note: This is my first post here. I hope my question is clear and specific enough, please do make suggestions if there is anything I should do to improve the question. Thanks!
Yes, you can perform a non-equi join.
(Note I've changed log and summary to Log and Summary as the originals are already functions in R.)
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch=0L,
allow.cartesian = TRUE][, .(from_date = date[1],
to_date = date.1[1],
event1 = sum(event1),
event2 = sum(event2)),
keyby = "period"]
To sum over a pattern of columns, use lapply with .SD:
joined_result <-
Log[Summary,
on = c("date>=from_date", "date<=to_date"),
nomatch = 0L,
allow.cartesian = TRUE]
cols <- grep("event[a-z]?[0-9]", names(joined_result), value = TRUE)
joined_result[, lapply(.SD, sum),
.SDcols = cols,
keyby = .(period,
from_date = date,
to_date = date.1)]
With data.table, it is possible to aggregate during a non-equi join using by = .EACHI.
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
lapply(.SD, sum), by = .EACHI]
date date event1 event2
1: 2017-01-01 2017-01-31 12 9
2: 2017-01-03 2017-04-01 31 31
3: 2017-02-08 2017-03-08 4 3
4: 2017-03-07 2017-05-01 17 21
With some additional clean-up:
log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L,
c(period = period, lapply(.SD, sum)), by = .EACHI][
, setnames(.SD, 1:2, c("from_date", "to_date"))]
from_date to_date period event1 event2
1: 2017-01-01 2017-01-31 A 12 9
2: 2017-01-03 2017-04-01 B 31 31
3: 2017-02-08 2017-03-08 C 4 3
4: 2017-03-07 2017-05-01 D 17 21
I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2
I have this csv data
Date Kilometer
2015-01-01 15:56:00 1
2015-01-01 17:40:00 2
2015-01-02 14:38:00 4
2015-01-02 14:45:00 3
And would like to group date and sum kilometer like that
Date Kilometer
2015-01-01 3
2015-01-02 7
We can use data.table
library(data.table)
library(lubridate)
setDT(df)[, .(Kilometer = sum(Kilometer)) , .(Date=date(Date))]
This can be done using dplyr and lubridate
library(dplyr)
df %>% group_by(Date = lubridate::date(Date)) %>% summarise(Kilometer=sum(Kilometer))
Date Kilometer
(date) (int)
1 2015-01-01 3
2 2015-01-02 7
Looking to add a bi-weekly date column to a data table. I have a working solution but it seems messy. Also, I have the feeling rolling joins should do the trick, but I'm not sure how. Are there any better solutions to creating a grouping for bi-weekly dates?
# Mock data table
dt <- data.table(value = runif(20), date = seq(as.Date("2015-01-01"), as.Date("2015-01-20"), by = "days"))
# Bi-weekly dates starting with most recent date and working backwards
bidates <- data.table(bi = seq(dt[, max(date)], dt[, min(date)], by = -14))
# Expand out bi-weekly dates to match up with every date in that range
bidates <- bidates[, seq(bi - 13, bi, by = "days"), by = bi]
# Key and merge
setkey(dt, date)
setkey(bidates, V1)
dt[bidates, bi := i.bi]
Here's how you can use rolling joins:
bis = dt[, .(date = seq(max(date), min(date), by = -14))][, bi := date]
setkey(bis, date)
setkey(dt, date)
bis[dt, roll = -Inf]
# date bi value
# 1: 2015-01-01 2015-01-06 0.2433854
# 2: 2015-01-02 2015-01-06 0.5454916
# 3: 2015-01-03 2015-01-06 0.3334531
# 4: 2015-01-04 2015-01-06 0.9134877
# 5: 2015-01-05 2015-01-06 0.4557901
# 6: 2015-01-06 2015-01-06 0.3459536
# 7: 2015-01-07 2015-01-20 0.8024527
# 8: 2015-01-08 2015-01-20 0.1833166
# 9: 2015-01-09 2015-01-20 0.1024560
#10: 2015-01-10 2015-01-20 0.4052751
#11: 2015-01-11 2015-01-20 0.9564279
#12: 2015-01-12 2015-01-20 0.6413953
#13: 2015-01-13 2015-01-20 0.7614291
#14: 2015-01-14 2015-01-20 0.2176500
#15: 2015-01-15 2015-01-20 0.3352939
#16: 2015-01-16 2015-01-20 0.4847095
#17: 2015-01-17 2015-01-20 0.8450636
#18: 2015-01-18 2015-01-20 0.8513685
#19: 2015-01-19 2015-01-20 0.2012410
#20: 2015-01-20 2015-01-20 0.3847956
Starting from version 1.9.5+ you don't need to set the keys and can do:
bis[dt, roll = -Inf, on = 'date']