ALL;
I just have a data file with two columns, one is time series, one is values. Normally, the time interval between tow rows is exact 5 mins,but sometimes it is larger than 5 mins
A sample is as below:
dd <- data.table(date = c("2015-07-01 00:00:00", "2015-07-01 00:05:00", "2015-07-01 00:20:00","2015-07-01 00:25:00","2015-07-01 00:30:00"),
value = c(9,1,10,12,0))
what i want to do is to check the time interval between two rows, when the time interval is larger than 5 mins, then insert a new row below with 0 value, so , the result could be :
date value
2015-07-01 00:00:00 9
2015-07-01 00:05:00 1
2015-07-01 00:10:00 0
2015-07-01 00:15:00 0
2015-07-01 00:20:00 10
2015-07-01 00:25:00 12
2015-07-01 00:30:00 0
any suggestion and idea is welcome :)
We can do a join after converting to 'date' to DateClass
dd[, date := as.POSIXct(date)][]
dd[dd[, .(date=seq(min(date), max(date), by = "5 min"))], on = 'date'
][is.na(value), value := 0][]
# date value
#1: 2015-07-01 00:00:00 9
#2: 2015-07-01 00:05:00 1
#3: 2015-07-01 00:10:00 0
#4: 2015-07-01 00:15:00 0
#5: 2015-07-01 00:20:00 10
#6: 2015-07-01 00:25:00 12
#7: 2015-07-01 00:30:00 0
Related
Sometime the data may not be recorded. So i want to make na and then check na distribution. To make the problem simple, i will generate date and hour data each year. And join for this problem. How can i make it date and hour from year?
For example 2017 and 2018, maybe 17520 data (each 8760)
1 2016-01-01 01:00:00
2 2016-01-01 02:00:00
3 2016-01-01 03:00:00
4 2016-01-01 04:00:00
5 2016-01-01 05:00:00
6 2016-01-01 06:00:00
7 2016-01-01 07:00:00
8 2016-01-01 08:00:00
9 2016-01-01 09:00:00
10 2016-01-01 10:00:00
Try
seq(from=as.POSIXct("2017-01-01 01:00", tz="UTC"),
to=as.POSIXct("2018-12-31 23:00", tz="UTC"),
by="hour")
length(seq(from=as.POSIXct("2017-01-01 01:00", tz="UTC"),
to=as.POSIXct("2018-12-31 23:00", tz="UTC"),
by="hour"))
[1] 17519
So I am currently struggling to change the year part of a date using an if function.
My date column looks like this (there are many more values in the real dataframe so it can not be a manual solution):
Date
<S3: POSIXct>
2020-10-31
2020-12-31
I essentially want to create an if command (or any other method) such that if the month is less than 12 (ie. not December) then the year is increased by 1.
So far I have tried using the lubridate package to do:
if (month(df$Date)<12) {
df$Date <- df$Date %m+% years(1)
}
However the function seems to not be conditioning correctly and is instead adding one to every year including those where the month=12
Any help would be greatly appreciated!
You could approach it like this:
library(lubridate)
dates <- seq.Date(from = as.Date('2015-01-01'),
to = Sys.Date(),
by = 'months')
indicator <- month(dates) < 12
dates2 <- dates
dates2[indicator] <- dates[indicator] %m+% years(1)
res <- data.frame(dates, indicator, dates2)
> head(res, 25)
dates indicator dates2
1 2015-01-01 TRUE 2016-01-01
2 2015-02-01 TRUE 2016-02-01
3 2015-03-01 TRUE 2016-03-01
4 2015-04-01 TRUE 2016-04-01
5 2015-05-01 TRUE 2016-05-01
6 2015-06-01 TRUE 2016-06-01
7 2015-07-01 TRUE 2016-07-01
8 2015-08-01 TRUE 2016-08-01
9 2015-09-01 TRUE 2016-09-01
10 2015-10-01 TRUE 2016-10-01
11 2015-11-01 TRUE 2016-11-01
12 2015-12-01 FALSE 2015-12-01
13 2016-01-01 TRUE 2017-01-01
14 2016-02-01 TRUE 2017-02-01
15 2016-03-01 TRUE 2017-03-01
16 2016-04-01 TRUE 2017-04-01
17 2016-05-01 TRUE 2017-05-01
18 2016-06-01 TRUE 2017-06-01
19 2016-07-01 TRUE 2017-07-01
20 2016-08-01 TRUE 2017-08-01
21 2016-09-01 TRUE 2017-09-01
22 2016-10-01 TRUE 2017-10-01
23 2016-11-01 TRUE 2017-11-01
24 2016-12-01 FALSE 2016-12-01
25 2017-01-01 TRUE 2018-01-01
Found a more direct solution which doesnt use if commands.
df$Date[month(df$Date)<12] <- df$Date[month(df$Date)<12] %m+% years(1)
This still uses the month function in the lubridate package to get the conditions
I've got a dataset with the following shape
ID Start Time End Time
1 01/01/2017 00:15:00 01/01/2017 07:15:00
2 01/01/2017 04:45:00 01/01/2017 06:15:00
3 01/01/2017 10:20:00 01/01/2017 20:15:00
4 01/01/2017 02:15:00 01/01/2017 00:15:00
5 02/01/2017 15:15:00 03/01/2017 00:30:00
6 03/01/2017 07:00:00 04/01/2017 09:15:00
I would like to count every 15 min for an entire year how many items have started but not finished, so count the number of times with a start time greater or equal than the time I'm looking at and an end time less or equal than the time I'm looking at.
I'm looking for an approach using tidyverse/dplyr if possible.
Any help or guidance would be very much appreciated.
If I understand correctly, the OP wants to count the number of simultaneously active events.
One possibility to tackle this question is the coverage() function from Bioconductor's IRange package. Another one is to aggregate in a non-equi join which is available with the data.table package.
Non-equi join
# create sequence of datetimes (limited to 4 days for demonstration)
seq15 <- seq(lubridate::as_datetime("2017-01-01"),
lubridate::as_datetime("2017-01-05"), by = "15 mins")
# aggregate within a non-equi join
library(data.table)
result <- periods[.(time = seq15), on = .(Start.Time <= time, End.Time > time),
.(time, count = sum(!is.na(ID))), by = .EACHI][, .(time, count)]
result
time count
1: 2017-01-01 00:00:00 0
2: 2017-01-01 00:15:00 1
3: 2017-01-01 00:30:00 1
4: 2017-01-01 00:45:00 1
5: 2017-01-01 01:00:00 1
---
381: 2017-01-04 23:00:00 0
382: 2017-01-04 23:15:00 0
383: 2017-01-04 23:30:00 0
384: 2017-01-04 23:45:00 0
385: 2017-01-05 00:00:00 0
The result can be visualized graphically:
library(ggplot2)
ggplot(result) + aes(time, count) + geom_step()
Data
periods <- readr::read_table(
"ID Start.Time End.Time
1 01/01/2017 00:15:00 01/01/2017 07:15:00
2 01/01/2017 04:45:00 01/01/2017 06:15:00
3 01/01/2017 10:20:00 01/01/2017 20:15:00
4 01/01/2017 02:15:00 01/01/2017 00:15:00
5 02/01/2017 15:15:00 03/01/2017 00:30:00
6 03/01/2017 07:00:00 04/01/2017 09:15:00"
)
# convert date strings to class Date
library(data.table)
cols <- names(periods)[names(periods) %like% "Time$"]
setDT(periods)[, (cols) := lapply(.SD, lubridate::dmy_hms), .SDcols = cols]
periods
ID Start.Time End.Time
1: 1 2017-01-01 00:15:00 2017-01-01 07:15:00
2: 2 2017-01-01 04:45:00 2017-01-01 06:15:00
3: 3 2017-01-01 10:20:00 2017-01-01 20:15:00
4: 4 2017-01-01 02:15:00 2017-01-01 00:15:00
5: 5 2017-01-02 15:15:00 2017-01-03 00:30:00
6: 6 2017-01-03 07:00:00 2017-01-04 09:15:00
I have a dataframe where I splitted the datetime column by date and time (two columns). However, when I group by time it gives me duplicates in time. So, to analyze it I used table() on time column, and it gave me duplicates also. This is a sample of it:
> table(df$time)
00:00:00 00:00:00 00:15:00 00:15:00 00:30:00 00:30:00
2211 1047 2211 1047 2211 1047
As you may see, when I splitted one of the "unique" values kept a " " inside. Is there a easy way to solve this?
PS: The datatype of the time column is character.
EDIT: Code added
df$datetime <- as.character.Date(df$datetime)
x <- colsplit(df$datetime, ' ', names = c('Date','Time'))
df <- cbind(df, x)
There are a number of approaches. One of them is to use appropriate functions to extract Dates and Times from Datetime column:
df <- data.frame(datetime = seq(
from=as.POSIXct("2018-5-15 0:00", tz="UTC"),
to=as.POSIXct("2018-5-16 24:00", tz="UTC"),
by="30 min") )
head(df$datetime)
#[1] "2018-05-15 00:00:00 UTC" "2018-05-15 00:30:00 UTC" "2018-05-15 01:00:00 UTC" "2018-05-15 01:30:00 UTC"
#[5] "2018-05-15 02:00:00 UTC" "2018-05-15 02:30:00 UTC"
df$Date <- as.Date(df$datetime)
df$Time <- format(df$datetime,"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 01:30:00 2018-05-15 01:30:00
# 5 2018-05-15 02:00:00 2018-05-15 02:00:00
# 6 2018-05-15 02:30:00 2018-05-15 02:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 01:30:00 02:00:00 02:30:00 03:00:00 03:30:00 04:00:00 04:30:00 05:00:00 05:30:00
#3 2 2 2 2 2 2 2 2 2 2 2
#06:00:00 06:30:00 07:00:00 07:30:00 08:00:00 08:30:00 09:00:00 09:30:00 10:00:00 10:30:00 11:00:00 11:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#12:00:00 12:30:00 13:00:00 13:30:00 14:00:00 14:30:00 15:00:00 15:30:00 16:00:00 16:30:00 17:00:00 17:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#18:00:00 18:30:00 19:00:00 19:30:00 20:00:00 20:30:00 21:00:00 21:30:00 22:00:00 22:30:00 23:00:00 23:30:00
#2 2 2 2 2 2 2 2 2 2 2 2
#If the data were given as character strings and contain extra spaces the above approach will still work
df <- data.frame(datetime=c("2018-05-15 00:00:00","2018-05-15 00:30:00",
"2018-05-15 01:00:00", "2018-05-15 02:00:00",
"2018-05-15 00:00:00","2018-05-15 00:30:00"),
stringsAsFactors=FALSE)
df$Date <- as.Date(df$datetime)
df$Time <- format(as.POSIXct(df$datetime, tz="UTC"),"%H:%M:%S")
head(df)
# datetime Date Time
# 1 2018-05-15 00:00:00 2018-05-15 00:00:00
# 2 2018-05-15 00:30:00 2018-05-15 00:30:00
# 3 2018-05-15 01:00:00 2018-05-15 01:00:00
# 4 2018-05-15 02:00:00 2018-05-15 02:00:00
# 5 2018-05-15 00:00:00 2018-05-15 00:00:00
# 6 2018-05-15 00:30:00 2018-05-15 00:30:00
table(df$Time)
#00:00:00 00:30:00 01:00:00 02:00:00
# 2 2 1 1
reshape2::colsplit accepts regular expressions, so you could split on '\s+' which matches 1 or more whitespace characters.
You can find out more about regular expressions in R using ?base::regex. The syntax is generally constant between languages, so you can use pretty much any regex tutorial. Take a look at https://regex101.com/. This site evaluates your regular expressions in real time and shows you exactly what each part is matching. It is extremely helpful!
Keep in mind that in R, as compared to most other languages, you must double the number of backslashes \. So \s (to match 1 whitespace character) must be written as \\s in R.
I have an R data.frame containing one value for every quarter of hour
Date A B
1 2015-11-02 00:00:00 0 0 //day start
2 2015-11-02 00:15:00 0 0
3 2015-11-02 00:30:00 0 0
4 2015-11-02 00:45:00 0 0
...
96 2015-11-02 23:45:00 0 0 //day end
97 2015-11-03 00:00:00 0 0 //new day
...
6 2016-03-23 01:15:00 0 0 //last record
I use xts to construct a time series
xtsA <- xts(data$A,data$Date)
by using apply.daily I get the result I expect
apply.daily(xtsA, sum)
Date A
1 2015-11-02 23:45:00 400
2 2015-11-03 23:45:00 400
3 2015-11-04 23:45:00 500
but apply.weekly seems to use Monday as last day of the week
Date A
19 2016-03-07 00:45:00 6500 //Monday
20 2016-03-14 00:45:00 5500 //Monday
21 2016-03-21 00:45:00 5000 //Monday
and I do not understand why it uses 00:45:00. Does anyone know?
Data is imported from CSV file the Date column looks like this:
data <- read.csv("...", header=TRUE)
Date A
1 151102 0000 0
...
The error is in the date time interpretation and using
data$Date <- as.POSIXct(strptime(data$Date, "%y%m%d %H%M"), tz = "GMT")
solves it, and now apply.weekly returns
Date A
1 2015-11-08 23:45:00 3500 //Sunday
2 2015-11-15 23:45:00 4000 //Sunday
...