Vectorized functions in R's data.table - r

Problem: I try to add to the below data.table object a column in which for each row a list of weeks will be displayed. I.e. if START = "2020-01-01" and END = "2020-01-15" the week column shall consist of a list of respective weeks for this time interval (2020 W01, 2020 W02, 2020 W03). I want to keep the function that prepares the data separatly due to code structure. However, the current function results in an error.
Question: Is there a way to keep it that simple i.e. w/o referring in the function call get_weeks to the data.table object? How could a modified function look like? Cheers!
dt <- data.table(
ID = c(1, 2, 3),
START = c("2020-01-01", "2020-03-01", "2020-03-14"),
END = c("2020-01-15", "2020-03-12", "2020-03-26")
)
get_weeks <- function(start_date, end_date){
date_range <- c(start_date, end_date)
date_range <- ymd(date_range)
dt_range <- seq.Date(date_range[1], date_range[2], "day")
dt_range_week <- list(unique(format(as.Date(dt_range), "%G W%V")))
dt_range_week
}
dt[, weeks_for_filter_table := get_weeks("START", "END")]

You could use Map/mapply :
library(data.table)
dt[, weeks_for_filter_table := mapply(get_weeks, START, END)]
dt
# ID START END weeks_for_filter_table
#1: 1 2020-01-01 2020-01-15 2020 W01,2020 W02,2020 W03
#2: 2 2020-03-01 2020-03-12 2020 W09,2020 W10,2020 W11
#3: 3 2020-03-14 2020-03-26 2020 W11,2020 W12,2020 W13

Related

choose semi-last observations based on date in data.table in R

I have a data.table with dates in it (as factor variables). I am getting the lag values from these. How can I tell R to run the get the lag values only for the observations dated semi-last? In this case this would be start == "01.01.2015"?
example data:
ID <- rep("A5", 15)
product <- rep(c("prod1","prod2","prod3", "prod55", "prod4", "prod9", "prod83"),3)
start <- c(rep("01.01.2016", 3), rep("01.01.2015", 3), rep("01.01.2014",3),
rep("01.01.2013",3), rep("01.01.2012",3))
prodID <- c(3,1,2,3,1,2,3,1,2,3,2,1,3,1,2)
mydata <- cbind(ID, product[1:15], start, prodID)
mydata <- as.data.table(mydata)
mydata[, (nameCols) := shift(.SD, 3, fill = "NA", "lead"), .SDcols= c("start", "V2"), by = "prodID"]
For now I have used this to get to my results:
mydata[start == "01.01.2015"]
The problem is that the semi-last date is not always the same date. I will be repeating this procedure many times and i want to avoid having to specify this by hand. Any ideas?
Convert the data to date object and sort to select semi-last date.
library(data.table)
mydata[, start := as.IDate(start, '%d.%m.%Y')]
mydata[start == sort(unique(start), decreasing = TRUE)[2]]
# ID V2 start prodID
#1: A5 prod55 2015-01-01 3
#2: A5 prod4 2015-01-01 1
#3: A5 prod9 2015-01-01 2

Separate operations on groups of time series values identified by same flag in R

Does anyone have a solution to perform
separate operations on
groups of consecutive values that are a
subset of a time series and are
identified by a reoccurring, identical flag
with R ?
In the example data set created by the code below, this would refer for example to calculating the mean of “value” separately for each group where “flag” == 1 on consecutive days.
A typical case in science would be a data set recorded by an instrument that repeatedly executes a calibration procedure and flags the corresponding data with the same flag, but the user needs to evaluate each calibration separately with the same procedure.
Thanks for your suggestions. Jens
library(lubridate)
df <- data.frame(
date = seq(ymd("2018-01-01"), ymd("2018-06-29"), by = "days"),
flag = rep( c(rep(1,10), rep(0, 20)), 6),
value = seq(1,180,1)
)
The data.table function rleid is great for giving group IDs to runs of consecutive values. I continue to use data.table, but you could everything but the rleid part just as well in dplyr or base.
My answer comes down to use data.table::rleid and then pick your favorite way to take the mean by group (R-FAQ link).
library(data.table)
setDT(df)
df[, r_id := rleid(flag)]
df[flag == 1, list(
min_date = min(date),
max_date = max(date),
mean_value = mean(value)
), by = r_id]
# r_id min_date max_date mean_value
# 1: 1 2018-01-01 2018-01-10 5.5
# 2: 3 2018-01-31 2018-02-09 35.5
# 3: 5 2018-03-02 2018-03-11 65.5
# 4: 7 2018-04-01 2018-04-10 95.5
# 5: 9 2018-05-01 2018-05-10 125.5
# 6: 11 2018-05-31 2018-06-09 155.5

R how to avoid a loop. Counting weekends between two dates in a row for each row in a dataframe

I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2

weekend dates within an interval R

I'm trying to identify whether or not a weekend fell within an interval of dates. I've been able to identify if a specific date is a weekend, but not when trying to look at a range of dates. Is this possible? If so, please advise. TIA.
library(lubridate, chron)
start.date <- c("1/1/2017", "2/1/2017")
end.date <- c("1/21/2017", "2/11/2017")
df <- data.frame(start.date, end.date)
df$start.date <- mdy(df$start.date)
df$end.date <- mdy(df$end.date)
df$interval.date <- interval(df$start.date, df$end.date)
df$weekend.exist <- ifelse(is.weekend(df$interval.date), 1, 0)
# Error in dts - floor(dts) :
# Arithmetic operators undefined for 'Interval' and 'Interval' classes:
# convert one to numeric or a matching time-span class.
why don't you prefer a seq of dates rather than creating the interval ? like
df$weekend.exist <- sapply(1:nrow(df), function(i)
as.numeric(any(is.weekend(seq(df$start.date[i], df$end.date[i],by = "day")))))
# [1] 1 1
library(dplyr)
df %>%
group_by(start.date,end.date) %>%
mutate(weekend.exist = as.numeric(any(is.weekend(seq(start.date, end.date,by = "day")))))
# start.date end.date weekend.exist
# <date> <date> <dbl>
# 1 2017-01-01 2017-01-21 1
# 2 2017-02-01 2017-02-03 1

Find range of values in each unique day

I have the following example:
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"),
to = as.POSIXct("2010-10-10 22:00"), by = 3600)
Dat <- data.frame(DateTime = Date1,
t = rnorm(length(Date1)))
I would like to find the range of values in a given day (i.e. maximum - minimum).
First, I've defined additional columns which define the unique days in terms of the date and in terms of the day of year (doy).
Dat$date <- format(Dat$DateTime, format = "%Y-%m-%d") # find the unique days
Dat$doy <- as.numeric(format(Dat$DateTime, format="%j")) # find the unique days
To then find the range I tried
by(Dat$t, Dat$doy, function(x) range(x))
but this returns the range as two values not a single value, So, my question is, how do I find the calculated range for each day and return them in a data.frame which has
new_data <- data.frame(date = unique(Dat$date),
range = ...)
Can anyone suggest a method for doing this?
I tend to use tapply for this kind of thing. ave is also useful sometimes. Here:
> dr = tapply(Dat$t,Dat$doy,function(x){diff(range(x))})
Always check tricksy stuff:
> dr[1]
121
3.084317
> diff(range(Dat$t[Dat$doy==121]))
[1] 3.084317
Use the names attribute to get the day-of-year and the values to make a data frame:
> new_data = data.frame(date=names(dr),range=dr)
> head(new_data)
date range
121 121 3.084317
122 122 4.204053
Did you want to convert the number day-of-year back to a date object?
# Use the data.table package
require(data.table)
# Set seed so data is reproducible
set.seed(42)
# Create data.table
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"), to = as.POSIXct("2010-10-10 22:00"), by = 3600)
DT <- data.table(date = as.IDate(Date1), t = rnorm(length(Date1)))
# Set key on data.table so that it is sorted by date
setkey(DT, "date")
# Make a new data.table with the required information (can be used as a data.frame)
new_data <- DT[, diff(range(t)), by = date]
# date V1
# 1: 2010-05-01 4.943101
# 2: 2010-05-02 4.309401
# 3: 2010-05-03 4.568818
# 4: 2010-05-04 2.707036
# 5: 2010-05-05 4.362990
# ---
# 159: 2010-10-06 2.659115
# 160: 2010-10-07 5.820803
# 161: 2010-10-08 4.516654
# 162: 2010-10-09 4.010017
# 163: 2010-10-10 3.311408

Resources