Related
I have two columns of dates. Two example dates are:
Date1= "2015-07-17"
Date2="2015-07-25"
I am trying to count the number of Saturdays and Sundays between the two dates each of which are in their own column (5 & 7 in this example code). I need to repeat this process for each row of my dataframe. The end results will be one column that represents the number of Saturdays and Sundays within the date range defined by two date columns.
I can get the code to work for one row:
sum(weekdays(seq(Date1[1,5],Date2[1,7],"days")) %in% c("Saturday",'Sunday')*1))
The answer to this will be 3. But, if I take out the "1" in the row position of date1 and date2 I get this error:
Error in seq.Date(Date1[, 5], Date2[, 7], "days") :
'from' must be of length 1
How do I go line by line and have one vector that lists the number of Saturdays and Sundays between the two dates in column 5 and 7 without using a loop? Another issue is that I have 2 million rows and am looking for something with a little more speed than a loop.
Thank you!!
map2* functions from the purrr package will be a good way to go. They take two vector inputs (eg two date columns) and apply a function in parallel. They're pretty fast too (eg previous post)!
Here's an example. Note that the _int requests an integer vector back.
library(purrr)
# Example data
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
# Wrapper function to compute number of weekend days between dates
n_weekend_days <- function(date_1, date_2) {
sum(weekdays(seq(date_1, date_2, "days")) %in% c("Saturday",'Sunday'))
}
# Iterate row wise
map2_int(d$Date1, d$Date2, n_weekend_days)
#> [1] 3 4 2
If you want to add the results back to your original data frame, mutate() from the dplyr package can help:
library(dplyr)
d <- mutate(d, end_days = map2_int(Date1, Date2, n_weekend_days))
d
#> Date1 Date2 end_days
#> 1 2015-07-17 2015-07-25 3
#> 2 2015-07-28 2015-08-14 4
#> 3 2015-08-15 2015-08-20 2
Here is a solution that uses dplyr to clean things up. It's not too difficult to use with to assign the columns in the dataframe directly.
Essentially, use a reference date, calculate the number of full weeks (by floor or ceiling). Then take the difference between the two. The code does not include cases in which the start date or end data fall on Saturday or Sunday.
# weekdays(as.Date(0,"1970-01-01")) -> "Friday"
require(dplyr)
startDate = as.Date(0,"1970-01-01") # this is a friday
df <- data.frame(start = "2015-07-17", end = "2015-07-25")
df$start <- as.Date(df$start,"", format = "%Y-%m-%d", origin="1970-01-01")
df$end <- as.Date(df$end, format = "%Y-%m-%d","1970-01-01")
# you can use with to define the columns directly instead of %>%
df <- df %>%
mutate(originDate = startDate) %>%
mutate(startDayDiff = as.numeric(start-originDate), endDayDiff = as.numeric(end-originDate)) %>%
mutate(startWeekDiff = floor(startDayDiff/7),endWeekDiff = floor(endDayDiff/7)) %>%
mutate(NumSatsStart = startWeekDiff + ifelse(startDayDiff %% 7>=1,1,0),
NumSunsStart = startWeekDiff + ifelse(startDayDiff %% 7>=2,1,0),
NumSatsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 1,1,0),
NumSunsEnd = endWeekDiff + ifelse(endDayDiff %% 7 >= 2,1,0)
) %>%
mutate(NumSats = NumSatsEnd - NumSatsStart, NumSuns = NumSunsEnd - NumSunsStart)
Dates are number of days since 1970-01-01, a Thursday.
So the following is the number of Saturdays or Sundays since that date
f <- function(d) {d <- as.numeric(d); r <- d %% 7; 2*(d %/% 7) + (r>=2) + (r>=3)}
For the number of Saturdays or Sundays between two dates, just subtract, after decrementing the start date to have an inclusive count.
g <- function(d1, d2) f(d2) - f(d1-1)
These are all vectorized functions so you can just call directly on the columns.
# Example data, as in Simon Jackson's answer
d <- data.frame(
Date1 = as.Date(c("2015-07-17", "2015-07-28", "2015-08-15")),
Date2 = as.Date(c("2015-07-25", "2015-08-14", "2015-08-20"))
)
As follows
within(d, end_days<-g(Date1,Date2))
# Date1 Date2 end_days
# 1 2015-07-17 2015-07-25 3
# 2 2015-07-28 2015-08-14 4
# 3 2015-08-15 2015-08-20 2
I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE
I would like to take a mean of a 7 day rolling window with 1 day increments of data that is collected at 30 minute intervals.
I have tried using data.table with by conditional statement with no success. Any guidane would be greatly appreciated.
# packages
library(data.table)
library(lubridate)
# Set set.seed to have reproducible sampling
set.seed(42)
# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")
# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
sample1 = sample(1:20, 17520, replace = TRUE))
# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]
# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
end = seq(start + days(7), end + days(7), by = "1 day"))
# Create date variable for merging datasets.
dat2[, date := as.Date(start)]
# mergre datasets.
dat <- merge(dat, dat2, by="date")
# Tried
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
# timestamp sample.mean
# 1: TRUE 10.46638
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
# timestamp sample.mean
# 1: TRUE 10.40059
# 2: FALSE 10.46767
# Warning messages:
# 1: In start:end :
# numerical expression has 17520 elements: only the first used
# 2: In start:end :
# numerical expression has 17520 elements: only the first used
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
# timestamp sample.mean
# 1: TRUE 19.00000
# 2: FALSE 10.46589
I'm not 100% sure I understand your exact parameters, but here's the basic approach:
setkey(dat, date)
#pull the 7 previous days
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
#nomatch = 0L will exclude any requested dates outside the interval
mean(sample1), nomatch = 0L], by = date]
# date V1
# 1: 2014-01-01 12.31250
# 2: 2014-01-02 10.94792
# 3: 2014-01-03 11.27083
# 4: 2014-01-04 11.10417
# 5: 2014-01-05 10.79167
# ---
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542
Some possible tinkers:
Change 7L to whatever window you'd like; use positive if you want forward-looking averages
If you want to go by timestamp, you'll have to adjust the 7L to match whatever units (seconds/minutes/hours/etc)
The extreme points of the interval are not technically correct since the window is shorter than requested; exclude nomatch and these points will return as NA
Use .(var = mean(sample1)) to name the output column var.
Here's one approach:
library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]
This assumes that your data is already sorted by date; if not, use keyby=date instead of by=date. If you don't want to juggle intermediate objects, there is a one-liner:
# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]
You may need to tweak the arguments to rollmean to fit your particular definition of the window. #eddi suggested that runmean from the caTools library is typically faster than zoo's rollmean and so is probably also worth a look.
Crude benchmark with the OP's example data:
dat2 = copy(dat)
# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
mean(sample1), nomatch = 0L], by = date]
})
user system elapsed
0.33 0.00 0.35
# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})
user system elapsed
0 0 0
Why it's faster: Here, we're computing 365 means of 48 numbers and then a rolling mean of length 365; which is less computationally costly than making 365 merges to find 48*7 numbers and then taking the mean of the latter.
I have a time series data with two columns: 1) a POSIX date time column of 30 minute intervals and 2) a value for each interval, as shown below:
read_date_time int_val
2013-01-15 15:00:00 2.3
2013-01-15 15:30:00 2.4
I've written a function that pivots the data.table so that there are 48 columns for each time interval for each row representing a day.
read_date 00:00 00:30 01:00 01:30 ...
2013-01-15 1.3 1.4 1.2 1.5 ...
The function involved creating two new columns (pure_date and interval) which are used as IDs as part of the reshape function. However I'm finding that the new columns are also added to the original table and the original read_date_time column is removed.
int_val pure_date interval
6.829986e-05 2013-08-31 00:00:00
6.887250e-05 2013-08-31 00:30:00
This causes numerous problems downstream as the original data set is reused in other functions. I'm aware that I could probably bypass some of these problems using data.frame operations instead, however as I'm handling very large quantities of data and efficiency is key, really I need a data.table solution.
What am I doing wrong?
Code for replication....
require(data.table)
require(reshape)
require(stringr)
# Create time_array for example
set.seed(1L) ## for reproducibility
dt_format = "%Y-%m-%d %H:%M"
time_seq <- seq.POSIXt(
as.POSIXct("2012-01-01 00:00:00", format=dt_format),
as.POSIXct("2013-12-31 00:00:00", format=dt_format),
by = "30 mins")
values <- runif(NROW(time_seq),0,1)
combined_data_set <- data.table(read_date_time = time_seq, int_val = values)
> head(combined_data_set) # Format wanted
# Define Pivoting Function
pivot_data <- function(A) {
con_data <- A
con_data[,pure_date := as.Date(read_date_time)]
con_data[,interval := str_sub(as.character(read_date_time),-8,-1)]
con_data[,read_date_time := NULL]
con_data <- data.table(read_date = as.character(con_data$pure_date),
interval = con_data$interval,
int_val = con_data$int_val)
pivoted <- recast(con_data, read_date ~ interval,
id.var = c("read_date","interval"))
return(pivoted)
}
# Apply to data set
pivoted_output <- pivot_data(combined_data_set)
# Original data has been altered, what's happened!!!!!
> head(combined_data_set)
pivot_data <- function(A) {
con_data <- copy(A)
con_data[,pure_date := as.Date(read_date_time)]
con_data[,interval := str_sub(as.character(read_date_time),-8,-1)]
con_data[,read_date_time := NULL]
con_data <- data.table(read_date = as.character(con_data$pure_date),
interval = con_data$interval,
int_val = con_data$int_val)
pivoted <- recast(con_data, read_date ~ interval,
id.var = c("read_date","interval"))
return(pivoted)
}
As mentioned, just a school-boy error, copy(A) does the job....
I'm a new user of R and I'm a little bit stuck, my data looks like this:
dates temp
01/31/2011 40
01/30/2011 34
01/29/2011 30
01/28/2011 52
01/27/2011 39
01/26/2011 37
...
01/01/2011 31
i want take only temp under 40 degrees and with the dates of beginning and the end and how many days it lasts, for example:
from to days
01/29/2011 01/30/2011 2
01/26/2011 01/27/2011 2
I tried with difftime but it didn't work, maybe with a function it will.
any help would be appreciated.
I'd do something like this. I'll use data.table here.
df <- read.table(header=TRUE, text="dates temp
01/31/2011 40
01/30/2011 34
01/29/2011 30
01/28/2011 52
01/27/2011 39
01/26/2011 37", stringsAsFactors=FALSE)
require(data.table)
dt <- data.table(df)
dt <- dt[, `:=`(date.form = as.Date(dates, format="%m/%d/%Y"),
id = cumsum(as.numeric(temp >= 40)))][temp < 40]
dt[, list(from=min(date.form), to=max(date.form), count=.N), by=id]
# id from to count
# 1: 1 2011-01-29 2011-01-30 2
# 2: 2 2011-01-26 2011-01-27 2
The idea is to first create a column with the dates column converted to Date format first. Then, another column id that finds the positions where temp >= 40 and uses that to create the group of values that are within two temp>=40. That is, if you have c(40, 34, 30, 52, 39, 37), then you'd want c(1,1,1,2,2,2). That is, everything between to values >= 40, must belong to the same group (34, 30 -> 1 and 39, 37 -> 2). After doing this, I'd remove temp >= 40 entries.
then, you can split by this group and then take min and max and length(.) (which is by default stored in .N).
Not as elegant as Arun's data.table but here is base solution
DF <- read.table(text = "dates temp\n01/31/2011 40\n01/30/2011 34\n01/29/2011 30\n01/28/2011 52\n01/27/2011 39\n01/26/2011 37",
header = TRUE, stringsAsFactors = FALSE)
DF$dates <- as.POSIXct(DF$dates, format = "%m/%d/%Y")
DF <- DF[order(DF$dates), ]
DF$ID <- cumsum(DF$temp >= 40)
DF2 <- DF[DF$temp < 40, ]
# Explanation split : split DF2 by DF2$ID
# lapply : apply function on each list element given by split
# rbind : bind all the data together
do.call(rbind, lapply(split(DF2, DF2$ID), function(x)
data.frame(from = min(x$dates),
to = max(x$dates),
count = length(x$dates))))
## from to count
## 0 2011-01-26 2011-01-27 2
## 1 2011-01-29 2011-01-30 2
First read in the data. read.zoo handles many of the details all in one line including reordering the data to be ascending and converting the dates to "Date" class. If z is the resulting zoo object then coredata(z) gives the temperatures and time(z) gives the dates.
Lines <- "
dates temp
01/31/2011 40
01/30/2011 34
01/29/2011 30
01/28/2011 52
01/27/2011 39
01/26/2011 37
"
library(zoo)
z <- read.zoo(text = Lines, header = TRUE, format = "%m/%d/%Y")
The crux of all this is the use of rle which computes lengths and values from which we can derive all quantities:
tt <- time(z)
with(rle(coredata(z) < 40), {
to <- cumsum(lengths)[values]
lengths <- lengths[values]
from <- to - lengths + 1
data.frame(from = tt[from], to = tt[to], days = lengths)
})
Using the first 6 lines of the input data shown we get:
from to days
1 2011-01-26 2011-01-27 2
2 2011-01-29 2011-01-30 2