I have a column with dates that are formatted like this:
yyyymm (e.g. 201809)
I want them to be formatted like this:
mm.yyyy (e.g. 09.2018)
I tried:
FF5factors$date <- strptime(FF5factors$date, format= "%Y%m")
format(FF5factors$date, format="%m.%Y")
But it only returns NA values.
What about:
d <- '201809'
format(as.Date(d,'%Y%M'),'%m.%Y')
[1] "09.2018"
Here are some alternatives. The question did not provide date in reproducible form so we assume the first line below although the first 4 alternatives will also work with date <- "201809" and with date <- factor(201809) .
date <- 201809
# 1
sub("(....)(..)", "\\2.\\1", date)
## [1] "09.2018"
# 2
library(zoo)
format(as.yearmon(format(date), "%Y%m"), "%m.%Y")
## [1] "09.2018"
# 3
paste(substr(date, 5, 6), substr(date, 1, 4), sep = ".")
## [1] "09.2018"
# 4
format(as.Date(paste0(date, "01"), "%Y%m%d"), "%m.%Y")
## [1] "09.2018"
# 5
sprintf("%02d.%d", date %% 100, date %/%100)
## [1] "09.2018"
I am trying to find the difference between two dates in hours, and for the time differences that occur over the span of more than one day I am getting really outrageous and incorrect numbers.
Here is an example of the data:
Observation Status DateTime
1 Active 2016-11-04 22:32:49
2 Inactive 2016-11-05 08:30:56
I am running this command:
getDiff <- function(x) {
difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
}
diff_result <- dataframe[, time.diff := ifelse(Status == "Active",
getDiff(DateTime), NA)]
And I get the following output:
Observation Status DateTime Time.diff
1 Active 2016-11-04 22:32:49 8757.884
2 Inactive 2016-11-05 08:30:56
This command works for all other differences that do not happen on separate days.
The correct answer should be around 10 hours, not over 8000.
Also,
> class(DataFrame$DateTime)
[1] "POSIXct" "POSIXt"
Thank you in advance!
It seems OP has not converted DateTime format correctly. The 8757 hours are equivalent to about 1 year. Hence, it is possible DateTime are wrongly formatted.
The result looks fine using OP's data at my end.
library(data.table)
getDiff <- function(x) {
difftime(shift(x, fill = NA, type = "lead"), x, units = "hours")
}
setDT(df)
diff_result <- df[, time.diff := ifelse(Status == "Active",
getDiff(DateTime), NA)]
diff_result
# Observation Status DateTime time.diff
# 1: 1 Active 2016-11-04 22:32:49 9.968611
# 2: 2 Inactive 2016-11-05 08:30:56 NA
#
Data:
df <- read.table(text =
"Observation Status DateTime
1 Active '2016-11-04 22:32:49'
2 Inactive '2016-11-05 08:30:56'",
header = TRUE, stringsAsFactors = FALSE)
df$DateTime = as.POSIXct(df$DateTime, format = "%Y-%m-%d %H:%M:%S")
I have a large file of time-series data, which looks as follows. The dataset covers years, in increments of 15 minutes. A small subset looks like:
uniqueid time
a 2014-04-30 23:30:00
a 2014-04-30 23:45:00
a 2014-05-01 00:00:00
a 2014-05-01 00:15:00
a 2014-05-12 13:45:00
a 2014-05-12 14:00:00
b 2014-05-12 13:45:00
b 2014-05-12 14:00:00
b 2014-05-12 14:30:00
To reproduce above:
time<-c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
My goal is to count the number of rows per unique id, per consecutive timeflow. A consecutive timespan is when a unique id is stamped for each 15 minutes in a row (such as id A, which is stamped from 30.04.14 23.30 hrs until 01.05.14 00.15 hrs - hence 4 rows), yet when this flow of 15-minute iterations is disrupted (after 01.05.14 00:15, it is not stamped at 01.05.14 00:30 hence it is disrupted), it should count the next timestamp as start of a new consecutive timeflow and again calculate the number of rows until this flow is disrupted again. Time is POSIX.
As you can see in above example; a consecutive timeflow may cover different days, different months, or different years. I have many unique ids (and as said, a very large file), so I'm looking for a way that my computer can handle (loops probably wouldn't work).
I am looking for output something like:
uniqueid flow number_rows
a 1 4
a 2 2
b 3 2
b 4 1
I have looked into some time packages (such as lubridate), but given my limited R knowledge, I don't even know where to begin.
I hope all is clear - if not, I'd be happy to try to clarify it further. Thank you very much in advance!
Another way to do this with data.table also using a time difference would be to make use of the data.table internal values for group number and number of rows in each group:
library(data.table)
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(res)
uniqueid number_rows flow
1: a 4 1
2: a 2 2
3: b 2 3
4: b 1 4
Also since the sample data you posted didn't align with the subset you posted, I have included my data below:
Data
time<-as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00"))
uniqueid<-c("a","a","a","a","a","a","b","b","b")
mydf<-data.frame(uniqueid,time)
You can groupby the uniqueid and the cumulative sum of the difference of time between rows which is not equal to 15 min and that gives the flow id and then a count of rows should give you what you need:
A justification of the logic is whenever the time difference is not equal to 15 within each uniqueid, a new flow process should be generated so we label it as TRUE and combine that with the cumsum, it becomes a new flow id with the following consecutive rows:
library(dplyr)
mydf$time <- as.POSIXct(mydf$time, "%Y-%m-%d %H:%M:%S")
# convert the time column to POSIXct class so that we can apply the diff function correctly
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
# Source: local data frame [4 x 3]
# Groups: uniqueid [?]
#
# uniqueid flow num_rows
# <fctr> <dbl> <int>
# 1 a 1 4
# 2 a 2 2
# 3 b 3 2
# 4 b 4 1
Base R is pretty fast. Using crude benchmarking, I found it finished in half the time of DT, and I got tired of waiting for dplyr.
# estimated size of data, years x days x hours x 15mins x uniqueids
5*365*24*4*1000 # = approx 180M
# make data with posixct and characters of 180M rows, mydf is approx 2.5GB in memory
time<-rep(as.POSIXct(c("2014-04-30 23:30:00","2014-04-30 23:45:00","2014-05-01 00:00:00","2014-05-01 00:15:00",
"2014-05-12 13:45:00","2014-05-12 14:00:00","2014-05-12 13:45:00","2014-05-12 14:00:00",
"2014-05-12 14:30:00")),times = 20000000)
uniqueid<-rep(as.character(c("a","a","a","a","a","a","b","b","b")),times = 20000000)
mydf<-data.frame(uniqueid,time = time)
rm(time,uniqueid);gc()
Base R:
# assumes that uniqueid's are in groups and in order, and there won't be a followed by b that have the 15 minute "flow"
starttime <- Sys.time()
# find failed flows
mydf$diff <- c(0,diff(mydf$time))
mydf$flowstop <- mydf$diff != 15
# give each flow an id
mydf$flowid <- cumsum(mydf$flowstop)
# clean up vars
mydf$time <- mydf$diff <- mydf$flowstop <- NULL
# find flow length
mydfrle <- rle(mydf$flowid)
# get uniqueid/flowid pairs (unique() is too slow)
mydf <- mydf[!duplicated(mydf$flowid), ]
# append rle and remove separate var
mydf$number_rows <- mydfrle$lengths
rm(mydfrle)
print(Sys.time()-starttime)
# Time difference of 30.39437 secs
data.table:
library(data.table)
starttime <- Sys.time()
res<-setDT(mydf)[, list(number_rows=.N,flow=.GRP),
by=.(uniqueid,cumsum(as.numeric(difftime(time,shift(time,1L,type="lag",fill=0))) - 15))][,cumsum:=NULL]
print(Sys.time()-starttime)
# Time difference of 57.08156 secs
dplyr:
library(dplyr)
# convert the time column to POSIXct class so that we can apply the diff function correctly
starttime <- Sys.time()
mydf %>% group_by(uniqueid, flow = 1 + cumsum(c(F, diff(time) != 15))) %>%
summarize(num_rows = n())
print(Sys.time()-starttime)
# too long, did not finish after a few minutes
I think the assumption of uniqueid's and times being in order is huge, and the other solutions might be able to take advantage of that better. order() is easy enough to do.
I'm not sure about the impact of memory, or of the impact of different data sets that aren't so simple. It should be easy enough to break it into chunks and process if memory is an issue. It takes more code in Base R for sure.
Having both ordered "id" and "time" columns, we could build a single group to operate on by creating a logical vector of indices wherever either "id" changes or "time" is > 15 minutes.
With:
id = as.character(mydf$uniqueid)
tm = mydf$time
find where "id":
id_gr = c(TRUE, id[-1] != id[-length(id)])
and "time":
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
change and combine them in:
gr = id_gr | tm_gr
which shows wherever either "id" changed or "time" > 15.
And to get the result:
tab = tabulate(cumsum(gr)) ## basically, the only operation per group -- 'n by group'
data.frame(id = id[gr], flow = seq_along(tab), n = tab)
# id flow n
#1 a 1 4
#2 a 2 2
#3 b 3 2
#4 b 4 1
On a larger scale:
set.seed(1821); nid = 1e4
dat = replicate(nid, as.POSIXct("2016-07-07 12:00:00 EEST") +
cumsum(sample(c(1, 5, 10, 15, 20, 30, 45, 60, 90, 120, 150, 200, 250, 300), sample(5e2:1e3, 1), TRUE)*60),
simplify = FALSE)
names(dat) = make.unique(rep_len(letters, nid))
dat = data.frame(id = rep(names(dat), lengths(dat)), time = do.call(c, dat))
system.time({
id = as.character(dat$id); tm = dat$time
id_gr = c(TRUE, id[-1] != id[-length(id)])
tm_gr = c(0, difftime(tm[-1], tm[-length(tm)], unit = "mins")) > 15
gr = id_gr | tm_gr
tab = tabulate(cumsum(gr))
ans1 = data.frame(id = id[gr], flow = seq_along(tab), n = tab)
})
# user system elapsed
# 1.44 0.19 1.66
For comparison, included MikeyMike's answer:
library(data.table)
dat2 = copy(dat)
system.time({
ans2 = setDT(dat2)[, list(flow = .GRP, n = .N),
by = .(id, cumsum(as.numeric(difftime(time,
shift(time, 1L, type = "lag", fill = 0),
unit = "mins")) > 15))][, cumsum := NULL]
})
# user system elapsed
# 3.95 0.22 4.26
identical(as.data.table(ans1), ans2)
#[1] TRUE
I would like to take a mean of a 7 day rolling window with 1 day increments of data that is collected at 30 minute intervals.
I have tried using data.table with by conditional statement with no success. Any guidane would be greatly appreciated.
# packages
library(data.table)
library(lubridate)
# Set set.seed to have reproducible sampling
set.seed(42)
# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")
# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
sample1 = sample(1:20, 17520, replace = TRUE))
# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]
# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
end = seq(start + days(7), end + days(7), by = "1 day"))
# Create date variable for merging datasets.
dat2[, date := as.Date(start)]
# mergre datasets.
dat <- merge(dat, dat2, by="date")
# Tried
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
# timestamp sample.mean
# 1: TRUE 10.46638
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
# timestamp sample.mean
# 1: TRUE 10.40059
# 2: FALSE 10.46767
# Warning messages:
# 1: In start:end :
# numerical expression has 17520 elements: only the first used
# 2: In start:end :
# numerical expression has 17520 elements: only the first used
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
# timestamp sample.mean
# 1: TRUE 19.00000
# 2: FALSE 10.46589
I'm not 100% sure I understand your exact parameters, but here's the basic approach:
setkey(dat, date)
#pull the 7 previous days
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
#nomatch = 0L will exclude any requested dates outside the interval
mean(sample1), nomatch = 0L], by = date]
# date V1
# 1: 2014-01-01 12.31250
# 2: 2014-01-02 10.94792
# 3: 2014-01-03 11.27083
# 4: 2014-01-04 11.10417
# 5: 2014-01-05 10.79167
# ---
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542
Some possible tinkers:
Change 7L to whatever window you'd like; use positive if you want forward-looking averages
If you want to go by timestamp, you'll have to adjust the 7L to match whatever units (seconds/minutes/hours/etc)
The extreme points of the interval are not technically correct since the window is shorter than requested; exclude nomatch and these points will return as NA
Use .(var = mean(sample1)) to name the output column var.
Here's one approach:
library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]
This assumes that your data is already sorted by date; if not, use keyby=date instead of by=date. If you don't want to juggle intermediate objects, there is a one-liner:
# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]
You may need to tweak the arguments to rollmean to fit your particular definition of the window. #eddi suggested that runmean from the caTools library is typically faster than zoo's rollmean and so is probably also worth a look.
Crude benchmark with the OP's example data:
dat2 = copy(dat)
# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
mean(sample1), nomatch = 0L], by = date]
})
user system elapsed
0.33 0.00 0.35
# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})
user system elapsed
0 0 0
Why it's faster: Here, we're computing 365 means of 48 numbers and then a rolling mean of length 365; which is less computationally costly than making 365 merges to find 48*7 numbers and then taking the mean of the latter.
I have the following example:
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"),
to = as.POSIXct("2010-10-10 22:00"), by = 3600)
Dat <- data.frame(DateTime = Date1,
t = rnorm(length(Date1)))
I would like to find the range of values in a given day (i.e. maximum - minimum).
First, I've defined additional columns which define the unique days in terms of the date and in terms of the day of year (doy).
Dat$date <- format(Dat$DateTime, format = "%Y-%m-%d") # find the unique days
Dat$doy <- as.numeric(format(Dat$DateTime, format="%j")) # find the unique days
To then find the range I tried
by(Dat$t, Dat$doy, function(x) range(x))
but this returns the range as two values not a single value, So, my question is, how do I find the calculated range for each day and return them in a data.frame which has
new_data <- data.frame(date = unique(Dat$date),
range = ...)
Can anyone suggest a method for doing this?
I tend to use tapply for this kind of thing. ave is also useful sometimes. Here:
> dr = tapply(Dat$t,Dat$doy,function(x){diff(range(x))})
Always check tricksy stuff:
> dr[1]
121
3.084317
> diff(range(Dat$t[Dat$doy==121]))
[1] 3.084317
Use the names attribute to get the day-of-year and the values to make a data frame:
> new_data = data.frame(date=names(dr),range=dr)
> head(new_data)
date range
121 121 3.084317
122 122 4.204053
Did you want to convert the number day-of-year back to a date object?
# Use the data.table package
require(data.table)
# Set seed so data is reproducible
set.seed(42)
# Create data.table
Date1 <- seq(from = as.POSIXct("2010-05-01 02:00"), to = as.POSIXct("2010-10-10 22:00"), by = 3600)
DT <- data.table(date = as.IDate(Date1), t = rnorm(length(Date1)))
# Set key on data.table so that it is sorted by date
setkey(DT, "date")
# Make a new data.table with the required information (can be used as a data.frame)
new_data <- DT[, diff(range(t)), by = date]
# date V1
# 1: 2010-05-01 4.943101
# 2: 2010-05-02 4.309401
# 3: 2010-05-03 4.568818
# 4: 2010-05-04 2.707036
# 5: 2010-05-05 4.362990
# ---
# 159: 2010-10-06 2.659115
# 160: 2010-10-07 5.820803
# 161: 2010-10-08 4.516654
# 162: 2010-10-09 4.010017
# 163: 2010-10-10 3.311408