I have a question about a faster way to compute something about date intervals
My input:
A data frame : one row by couple (people, period). On each row, I have the ID of a person, a date of start and a date of end.
A period of time : All the dates day by day during two years
What i try to do is to calculate the number of people I have date by date.
I have a code which is working, but not enough efficient with a large dataset (~ from 100 k to 1 M rows).
The current problem is since I have two years of date, my code does 730 times (365x2) the following steps :
Filter the dataset with the specific date included between the start date and the end date
Calculate the number of unique id in the filtered dataset
And these operations are very long or impossible with a large dataset
I am wondering if it exists a better and faster way to do these operations, like with aggregation or with another technique.
An example with a short input and output :
library(lubridate)
library(dplyr)
# Vector of date
vector_day <- seq(ymd('2017-01-01'), ymd('2018-12-30'), by= "days")
# Input Data
df <- data.frame(
id_people = c(1, 2, 3, 4, 1),
StartDate = c(as.Date("2018-11-01"), as.Date("2018-11-03"),as.Date("2018-12-01"),as.Date("2018-11-15") ,as.Date("2018-11-15")),
EndDate = c(as.Date("2018-11-10"), as.Date("2018-12-04"),as.Date("2018-12-10"),as.Date("2018-11-17"), as.Date("2018-11-23")),
Gender = c("F", "F", "M", "F", "F"))
# Function to compute the number of people given a spécific date
compute_nb_f_by_day <- function(date) {
cond1 <- df_f$StartDate <= date
cond2 <- df_f$EndDate > date
cond <- cond1 & cond2
res <- length(unique(df_f[cond,]$id_people))
return(res)
}
# An example of how the function works for on date
compute_nb_f_by_day(as.Date("2018-12-01"))
# Computation for all the dates
nb_f_by_day <- cbind(
data.frame(vector_day),
data.frame(nb_f_by_day <- sapply(vector_day, compute_nb_f_by_day)))
Thanks.
This solution benchmarked significantly faster than your code in the given example (your code: 0.132s; this code: 0.032s in my system). Give it a try to see if it'll significantly improve for the large dataset!
#-- Create the 'Interval'
df2 <- df %>%
mutate(DateInterval = StartDate %--% EndDate)
#-- Create a result df instead of using cbind (more efficient)
result_df <- data.frame(Day = vector_day, Nb = NA)
#-- Get intervals that contain the days in vector_day
result_df$Nb <- sapply(vector_day, function(day) {sum(day %within% df2$DateInterval)})
Related
I have a dataset A with a place, starting date and finish date. On the other hand, I have a dataset B also with a place, a date and number of cars.
library(data.table)
A <- data.table(Place = c(rep(c("Place_1","Place_2"), each = 20)),
Start_date = as.Date("2010-01-15"),
Finish_date = as.Date(rep(c("2011-03-01","2012-04-30","2012-01-20","2011-04-05"), each = 10)))
set.seed(1001)
B <- data.table(Date = rep(seq.Date(from = as.Date("2010-01-01"), to = as.Date("2013-01-01"), by="day"), 2),
Place = rep(c("Place_1","Place_2"),each = 1097),
Cars = round(runif(2194, 0, 10), 0))
I need to calculate in the dataset A a new column (total of cars) which is the sum of cars in dataset B; this sum of cars must be for a specific place and within certain period of time.
This is easily made with a for-loop statement.
for (i in 1:nrow(A)) {
A$Tcars[i] <- sum(B[Place == A$Place[i] & Date > A$Start_date[i] & Date < A$Finish_date[i]]$Cars)
}
But my real dataset has 30.000 rows and the loop option is inefficient and time consuming. So, I am looking for a vectorized way of doing this. I have tried the next code but it does not work:
A$Tcars<-sum(B[Place == A$Place & Date > A$Start_date & Date < A$Finish_date]$Cars)
You can use a non-equi join to update the table:
library(data.table)
A[, n := B[.SD, on=.(Place, Date > Start_date, Date < Finish_date),
sum(Cars), by=.EACHI]$V1]
If you look at ?data.table and the other introductory materials listed when you first type library(data.table), you'll get some intuition for :=, on=, by=, etc.
Other questions have centered around having a start and end date. (see the following for examples
Given start date and end date, reshape/expand data for each day between (each day on a row)
Expand rows by date range using start and end date
My question is different in that I only have one date column and I would like to convert the unequal date ranges to daily counts. This specific example created deals with number of workers on a job site at one time. Different crews of people come on different dates
A brief data frame provided is as follows:
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
I would like the data to look like this:
dw <- data.frame(date=c(seq(as.Date("1999-03-22"),as.Date("1999-04-10"),by= "day")),
work=c(rep(43,7),rep(95,10),rep(92,3)),
cumwork=c(rep(43,7),rep(138,10),rep(230,3)))
I have been stuck on this for some time. Any help would be appreciated!
UPDATE (7/5/2017): As pointed out by #Scarabee the dates in the dataframe 'dd' should be in date format. Have updated the code to reflect this
A possible way:
First, create the sequence of dates you're interested in as a one-column dataframe:
v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))
Next, join with your original dataframe and fill the missing values, for instance using dplyr and zoo:
library(dplyr)
library(zoo)
v %>%
left_join(dd, by = "date") %>%
na.locf
NB: I suppose that your dataframe dd actually contains dates (and not factors).
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
A solution similar, with base R (and zoo package):
dd$date <- as.Date(as.character(dd$date))
my.seq <- data.frame(date=seq.Date(from=range(dd$date)[1], to=range(dd$date)[2], by="day"))
output <- merge(my.seq, dd, all.x=TRUE)
output <- zoo::na.locf(output)
You first have to transform your date into a Date format. Then separately create a vector of complete dates and merge it with the original data. Eventually, run a "last observation carried forward" algorithm.
Here is a really fast pure base R solution:
ExpandDates <- function(df, lastColRepeat) {
myDiff <- diff(df$date)
dfOut <- data.frame(df$date[1] + 0:(sum(myDiff) + lastColRepeat - 1L),
stringsAsFactors=FALSE)
myDiff <- c(myDiff, lastColRepeat)
for (i in 2:3) {dfOut[,i] <- rep(df[ ,i], times = myDiff)}
names(dfOut) <- names(df)
dfOut
}
The last argument is to determine the number of times the last value should be repeated. As it stands, there is nothing in the original data.frame that would give this value. I'm also assuming that the "date" field is actually a date as pointed out by #Scarabee.
Here is some test data:
set.seed(123)
workVec <- sample(5000, 3000)
testDF <- data.frame(date = as.Date(sort(sample(12000, 3000)),
origin = "1970-01-01"), work = workVec,
cumwork = cumsum(workVec))
DplyrTest <- function(dd) { ## from #Scarabee
v <- data.frame(date = seq(min(dd$date), max(dd$date), by="day"))
v %>%
left_join(dd, by = "date") %>%
na.locf
}
a <- ExpandDates(testDF, 1)
b <- DplyrTest(testDF)
Test for equality:
identical(a$cumwork, as.integer(b$cumwork))
[1] TRUE
identical(a$work, as.integer(b$work))
[1] TRUE
identical(a$date, as.Date(b$date))
[1] TRUE
Benchmarks:
library(microbenchmark)
microbenchmark(DplyrTest(testDF), ExpandDates(testDF,1))
Unit: milliseconds
expr min lq mean median uq max neval cld
DplyrTest(testDF) 80.909303 84.337006 91.315057 86.320883 88.818739 173.69395 100 b
ExpandDates(testDF, 1) 1.122384 1.208184 2.521693 1.355564 1.486317 72.23444 100 a
I have a table including a time series of daily values (value), Date and a column with "0s". Here are the variables:
value <- c(37,19.75,19.5,14.5,24.75,25,25.5,19.75,19.75,14.25,21.25,21.75,17.5,16.25,14.5,
14.5,14.75,9.5,11.75,15.25,14.25,16.5,13.5,18.25,13.5,11.25,10.75,12,8.5,
9.75,14.75)
Date <- c("1997-05-01","1997-05-02","1997-05-03","1997-05-04","1997-05-05",
"1997-05-06","1997-05-07","1997-05-08","1997-05-09","1997-05-10",
"1997-05-11","1997-05-12","1997-05-13","1997-05-14","1997-05-15",
"1997-05-16","1997-05-17","1997-05-18","1997-05-19","1997-05-20",
"1997-05-21","1997-05-22","1997-05-23","1997-05-24","1997-05-25",
"1997-05-26","1997-05-27","1997-05-28","1997-05-29","1997-05-30",
"1997-05-31")
ncol <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)`
data <- data.frame(value, Date, ncol)
Date is formatted as Date using the "as.Date" function. now I want to add "1" to the some values in column "newcol" (with 0s) on a specific 5 days, eg. on the "1997.05.05","1997.05.11","1997.05.14","1997.05.18","1997.05.25" in the time series.
I created this function, but works for a date only:
x <- 1
i <- which(format(data$Date, "%Y.%m.%d") == "1997.05.05")
data$newcol[i] <- data$newcol[i] + x
how to do that best?
Then I would like to count the number of times that "value" appears >20 from a specific date (newcol = 1) for the previous 5 days. For example, the date 1997.05.25 how many times the value appears >20 to 1997.05.21.
This answers the 1st part of your question:
library(data.table)
setDT(data)[ Date %in% c("1997-05-05","1997-05-11","1997-05-14","1997-05-18","1997-05-25"), newcol := ncol+1 ]
# or perhaps better:
setDT(data)[, newcol := ifelse(Date %in% c("1997-05-05","1997-05-11","1997-05-14","1997-05-18","1997-05-25"), ncol+1, 0) ]
With base R this can be done
transform(data, newcol = as.integer(as.character(Date) %in%
c("1997-05-05","1997-05-11","1997-05-14","1997-05-18","1997-05-25")))
I'm currently struggling with a beginner's issue regarding the calculation of a time difference between two events.
I want to take a column consisting of date and time (both values in one column) into consideration and calculate a time difference with the value of the previous/next row with the same ID (A or B in this example).
ID = c("A", "A", "B", "B")
time = c("08.09.2014 10:34","12.09.2014 09:33","13.08.2014 15:52","11.09.2014 02:30")
d = data.frame(ID,time)
My desired output is in the format Hours:Minutes
time difference = c("94:59","94:59","682:38","682:38")
The format Days:Hours:Minutes or anything similar would also work, as long as it could be conveniently implemented. I am flexible regarding the format of the output, the above is just an idea that crossed my mind.
For each single ID, I always have two rows (in the example 2xA and 2xB). I don't have a convincing idea how to avoid the repition of the difference.
I've tried some examples before, which I found on stackoverflow. Most of them used POSIXt and strptime. However, I didn't manage to apply those ideas to my data set.
Here's my attempt using dplyr
library(dplyr)
d %>%
mutate(time = as.POSIXct(time, format = "%d.%m.%Y %H:%M")) %>%
group_by(ID) %>%
mutate(diff = paste0(gsub("[.].*", "", diff(time)*24), ":",
round(as.numeric(gsub(".*[.]", ".", diff(time)*24))*60)))
# Source: local data frame [4 x 3]
# Groups: ID
#
# ID time diff
# 1 A 2014-09-08 10:34:00 94:59
# 2 A 2014-09-12 09:33:00 94:59
# 3 B 2014-08-13 15:52:00 682:38
# 4 B 2014-09-11 02:30:00 682:38
A very (to me) hack-ish base solution:
ID <- c("A", "A", "B", "B")
time <- c("08.09.2014 10:34", "12.09.2014 09:33", "13.08.2014 15:52","11.09.2014 02:30")
d <- data.frame(ID, time)
d$time <- as.POSIXct(d$time, format="%d.%m.%Y %H:%M")
unlist(unname(lapply(split(d, d$ID), function(d) {
sapply(abs(diff(c(d$time[2], d$time))), function(x) {
sprintf("%s:%s", round(((x*24)%/%1)), round(((x*24)%%1 *60)))
})
})))
## [1] "94:59" "94:59" "682:38" "682:38"
I have to believe this function exists somewhere already, tho.
similar to the attempts of David and hrmbrmstr, I found that this solution using difftime works
I use a rowshift script I found on stackoverflow
rowShift <- function(x, shiftLen = 1L) {
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
d$time.c <- as.POSIXct(d$time, format = "%d.%m.%Y %H:%M")
d$time.prev <- rowShift(d$time.c,-1)
d$diff <- difftime(d$time.c,d$time.prev, units="hours")
Every other row of d$diff has positive/negative values in the results. I do remove all the rows with negative values and have the difference between the first and the last time for every ID.
I wish to create 24 hourly data frames in which each data.frame contains hourly demand for a product as 1 column, and the next 8 columns contain hourly temperatures. For example, for the data.frame for 8am, the data.frame will contain a column for demand at 8am, then eight columns for temperature ranging from the most current hour to the 7 past hours. The additional complication is that for hours before 8AM i.e. "4AM", I have to get yesterday's temperatures. I am hitting my head against the wall trying to figure out how to do this with apply or plyr, or a vectorized function.
demand8AM Temp8AM Temp7AM Temp6AM...Temp1AM
Demand4AM Temp4AM Temp3AM Temp2AM Temp1AM Temp12AM Temp11pm(Lag) Temp10pm(Lag)
In my code Hours are numbers; 1 is 12AM etc.
Here is some simple code I created to create the dataset I am dealing with.
#Creating some Fake Data
require(plyr)
# setting up some fake data
set.seed(31)
foo <- function(myHour, myDate){
rlnorm(1, meanlog=0,sdlog=1)*(myHour) + (150*myDate)
}
Hour <- 1:24
Day <-1:90
dates <-seq(as.Date("2012-01-01"), as.Date("2012-3-30"), by = "day")
myData <- expand.grid( Day, Hour)
names(myData) <- c("Date","Hour")
myData$Temperature <- apply(myData, 1, function(x) foo(x[2], x[1]))
myData$Date <-dates
myData$Demand <-(rnorm(1,mean = 0, sd=1)+.75*myData$Temperature )
## ok, done with the fake data generation.
It looks as though you could benefit from utilizing a time series. Here's my interpretation of what you want (I used the "mean" function in rollapply), not what you asked for. I recommend you read over the xts and zoo packages.
#create dummy time vector
time_index <- seq(from = as.POSIXct("2012-05-15 07:00"),
to = as.POSIXct("2012-05-17 18:00"), by = "hour")
#create dummy demand and temp.C
info <- data.frame(demand = sample(1:length(time_index), replace = T),
temp.C = sample (1:10))
#turn demand + temp.C into time series
eventdata <- xts(info, order.by = time_index)
x2 <- eventdata$temp.C
for (i in 1:8) {x2 <- cbind(x2, lag(eventdata$temp.C, i))}