Dataframe datetime value row filling - r

I have a CSV file that contain the following:
ts1<-read.table(header = TRUE, sep=",", text="
start, end, value
1,26/11/2014 13:00,26/11/2014 20:00,decreasing
2,26/11/2014 20:00,27/11/2014 09:00,increasing ")
I would like to transfer the above dataframe to a dataframe in which each row time column is opened and filled in with the value. The time gap is filled in from the start time to the end time - 1 (minus 1), as followed:
date hour value
1 26/11/2014 13:00 decreasing
2 26/11/2014 14:00 decreasing
3 26/11/2014 15:00 decreasing
4 26/11/2014 16:00 decreasing
5 26/11/2014 17:00 decreasing
6 26/11/2014 18:00 decreasing
7 26/11/2014 19:00 decreasing
8 26/11/2014 20:00 increasing
9 26/11/2014 21:00 increasing
10 26/11/2014 22:00 increasing
11 26/11/2014 23:00 increasing
12 26/11/2014 00:00 increasing
13 26/11/2014 01:00 increasing
14 26/11/2014 02:00 increasing
15 26/11/2014 03:00 increasing
16 26/11/2014 04:00 increasing
17 26/11/2014 05:00 increasing
18 26/11/2014 06:00 increasing
19 26/11/2014 07:00 increasing
20 26/11/2014 08:00 increasing
I tried to start with separating the hours from the dates:
> t <- strftime(ts1$end, format="%H:%M:%S")
> t
[1] "00:00:00" "00:00:00"

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(ts1)), grouped by the sequence of rows (1:nrow(ts1)), we convert the 'start' and 'end' columns to datetime class (using dmy_hm from lubridate), get the sequence by '1 hour', format the result to expected format, then split by space (tstrsplit), concatenate with the 'value' column, remove the 'rn' column by assigning to NULL. Finally, we can change the column names (if needed).
library(lubridate)
library(data.table)
res <- setDT(ts1)[,{st <- dmy_hm(start)
et <- dmy_hm(end)
c(tstrsplit(format(head(seq(st, et, by = "1 hour"),-1),
"%d/%m/%Y %H:%M"), "\\s+"), as.character(value))} ,
by = .(rn=1:nrow(ts1))
][, rn := NULL][]
setnames(res, c("date", "hour", "value"))[]
# date hour value
# 1: 26/11/2014 13:00 decreasing
# 2: 26/11/2014 14:00 decreasing
# 3: 26/11/2014 15:00 decreasing
# 4: 26/11/2014 16:00 decreasing
# 5: 26/11/2014 17:00 decreasing
# 6: 26/11/2014 18:00 decreasing
# 7: 26/11/2014 19:00 decreasing
# 8: 26/11/2014 20:00 increasing
# 9: 26/11/2014 21:00 increasing
#10: 26/11/2014 22:00 increasing
#11: 26/11/2014 23:00 increasing
#12: 27/11/2014 00:00 increasing
#13: 27/11/2014 01:00 increasing
#14: 27/11/2014 02:00 increasing
#15: 27/11/2014 03:00 increasing
#16: 27/11/2014 04:00 increasing
#17: 27/11/2014 05:00 increasing
#18: 27/11/2014 06:00 increasing
#19: 27/11/2014 07:00 increasing
#20: 27/11/2014 08:00 increasing

Here is a solution using lubridate and plyr. It processes each row of the data to make a sequence from the start to the end, and returns this with the value. Results from each row are combined into one data.frame. If you need to process the results further, you might be better off not separating the datetime into date and time
library(plyr)
library(lubridate)
ts1$start <- dmy_hm(ts1$start)
ts1$end <- dmy_hm(ts1$end)
adply(.data = ts1, .margin = 1, .fun = function(x){
datetime <- seq(x$start, x$end, by = "hour")
#data.frame(datetime, value = x$value)"
data.frame(date = as.Date(datetime), time = format(datetime, "%H:%M"), value = x$value)
})[, -(1:2)]

Related

group a column by date with different formats

I have a dataset where one column has a date and time values. Every date has multiple entries. The first row for every date has a date value inthe form 29MAY2018_00:00:00.000000 while the rest of the row for the same date has time values i.e. 20:00 - 21:00. The problem is that I want to sum the values in another column for each day.
The sample data has the following format
Date A
29MAY2018_00:00:00.000000
20:00 - 21:00 0.009
21:00 - 22:00 0.003
22:00 - 23:00 0.0003
23:00 - 00:00 0
30MAY2018_00:00:00.000000
00:00 - 01:00 -0.0016
01:00 - 02:00 -0.0012
02:00 - 03:00 -0.0002
03:00 - 04:00 -0.0023
04:00 - 05:00 0
05:00 - 06:00 -0.0005
20:00 - 21:00 -0.0042
21:00 - 22:00 -0.0035
22:00 - 23:00 -0.0026
23:00 - 00:00 -0.001
I have created a new column
data$C[data$A ==0 ] <- 0
data$C[data$A < 0 ] <- -1
data$C[data$A > 0 ] <- 1
I need to sum the column `C' for every date.
The output should be
A B
29-MAY-2019 4
30-MAY-2019 -9
31-MAY-2019 3
An option would be to create a grouping column based on the occurrence of full datetime format in the 'Date', summarise the first 'Date', convert it to Date format (with anydate from anytime) and get the sum of sign of 'A'
library(tidyverse)
library(anytime)
data %>%
group_by(grp = cumsum(str_detect(Date, "[A-Z]"))) %>%
summarise(Date = anydate(first(Date)),
B = sum(sign(A), na.rm = TRUE))

Transforming data into xts format

I have some data, and the Date column includes the time too. I am trying to get this data into xts format. I have tried below, but I get an error. Can anyone see anything wrong with this code? TIA
Date Open High Low Close
1 2017.01.30 07:00 1.25735 1.25761 1.25680 1.25698
2 2017.01.30 08:00 1.25697 1.25702 1.25615 1.25619
3 2017.01.30 09:00 1.25618 1.25669 1.25512 1.25533
4 2017.01.30 10:00 1.25536 1.25571 1.25093 1.25105
5 2017.01.30 11:00 1.25104 1.25301 1.25093 1.25262
6 2017.01.30 12:00 1.25260 1.25479 1.25229 1.25361
7 2017.01.30 13:00 1.25362 1.25417 1.25096 1.25177
8 2017.01.30 14:00 1.25177 1.25219 1.24900 1.25071
9 2017.01.30 15:00 1.25070 1.25307 1.24991 1.25238
10 2017.01.30 16:00 1.25238 1.25358 1.25075 1.25159
df = read.table(file = "GBPUSD60.csv", sep="," , header = TRUE)
dates = as.character(df$Date)
df$Date = NULL
Sept17 = xts(df, as.POSIXct(dates, format="%Y-%m-%d %H:%M"))

Count time stamps in different time intervals - issue with interval which spans midnight

I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2

How to finding a given length of runs in a series of data?

I'm trying to study times in which flow was operating at a given level. I would like to find when flows were above a given level for 4 or more hours. How would I go about doing this?
Sample code:
Date<-format(seq(as.POSIXct("2014-01-01 01:00"), as.POSIXct("2015-01-01 00:00"), by="hour"), "%Y-%m-%d %H:%M", usetz = FALSE)
Flow<-runif(8760, 0, 2300)
IsHigh<- function(x ){
if (x < 1600) return(0)
if (1600 <= x) return(1)
}
isHighFlow = unlist(lapply(Flow, IsHigh))
df = data.frame(Date, Flow, isHighFlow )
I was asked to edit my questions to supply what I would like to see as an output.
I would like to see a data from such as the one below. The only issue is the hourseHighFlow is incorrect. I'm not sure how to fix the code to generation the correct hoursHighFlow.
temp <- df %>%
mutate(highFlowInterval = cumsum(isHighFlow==1)) %>%
group_by(highFlowInterval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)))
#Then join the two tables together.
temp2<-sqldf("SELECT *
FROM temp LEFT JOIN df
ON df.Date BETWEEN temp.minDate AND temp.maxDate")
Able to use subset to select the length of time running at a high flow rate.
t<-subset(temp2,isHighFlow==1)
t<-subset(t, hoursHighFlow>=4)
Put it in a data.table:
require(data.table)
DT <- data.table(df)
Mark runs and lengths:
DT[,`:=`(r=.GRP,rlen=.N),by={r <- rle(isHighFlow);rep(1:length(r[[1]]),r$lengths)}]
Subset to long runs:
DT[rlen>4L]
How it works:
New columns are created in the second argument of DT[i,j,by] with :=.
.GRP and .N are special variables for, respectively, the index and size of the by group.
A data.table can be subset simply with DT[i], unlike a data.frame.
Apart from subsetting, most of what works with a data.frame works the same on a data.table.
Here is a solution using the dplyr package:
df %>%
mutate(interval = cumsum(isHighFlow!=lag(isHighFlow, default = 0))) %>%
group_by(interval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)), isHighFlow = mean(isHighFlow)) %>%
filter(hoursHighFlow >= 4, isHighFlow == 1)
Result:
interval hoursHighFlow minDate maxDate isHighFlow
1 25 4 2014-01-03 07:00 2014-01-03 10:00 1
2 117 4 2014-01-12 01:00 2014-01-12 04:00 1
3 245 6 2014-01-23 13:00 2014-01-23 18:00 1
4 401 6 2014-02-07 03:00 2014-02-07 08:00 1
5 437 5 2014-02-11 02:00 2014-02-11 06:00 1
6 441 4 2014-02-11 21:00 2014-02-12 00:00 1
7 459 4 2014-02-13 09:00 2014-02-13 12:00 1
8 487 4 2014-02-16 03:00 2014-02-16 06:00 1
9 539 7 2014-02-21 08:00 2014-02-21 14:00 1
10 567 4 2014-02-24 11:00 2014-02-24 14:00 1
.. ... ... ... ... ...
As Frank notes, you could achieve the same result with using rle to set intervals, replacing the mutate line with:
mutate(interval = rep(1:length(rle(df$isHighFlow)[[2]]),rle(df$isHighFlow)[[1]])) %>%

How to get week numbers from dates?

Looking for a function in R to convert dates into week numbers (of year) I went for week from package data.table.
However, I observed some strange behaviour:
> week("2014-03-16") # Sun, expecting 11
[1] 11
> week("2014-03-17") # Mon, expecting 12
[1] 11
> week("2014-03-18") # Tue, expecting 12
[1] 12
Why is the week number switching to 12 on tuesday, instead of monday? What am I missing? (Timezone should be irrelevant as there are just dates?!)
Other suggestions for (base) R functions are appreciated as well.
Base package Using the function strftime passing the argument %V to obtain the week of the year as decimal number (01–53) as defined in ISO 8601. (More details in the documentarion: ?strftime)
strftime(c("2014-03-16", "2014-03-17","2014-03-18", "2014-01-01"), format = "%V")
Output:
[1] "11" "12" "12" "01"
if you try with lubridate:
library(lubridate)
lubridate::week(ymd("2014-03-16", "2014-03-17","2014-03-18", '2014-01-01'))
[1] 11 11 12 1
The pattern is the same. Try isoweek
lubridate::isoweek(ymd("2014-03-16", "2014-03-17","2014-03-18", '2014-01-01'))
[1] 11 12 12 1
I understand the need for packages in certain situations, but the base language is so elegant and so proven (and debugged and optimized).
Why not:
dt <- as.Date("2014-03-16")
dt2 <- as.POSIXlt(dt)
dt2$yday
[1] 74
And then your choice whether the first week of the year is zero (as in indexing in C) or 1 (as in indexing in R).
No packages to learn, update, worry about bugs in.
Actually, I think you may have discovered a bug in the week(...) function, or at least an error in the documentation. Hopefully someone will jump in and explain why I am wrong.
Looking at the code:
library(lubridate)
> week
function (x)
yday(x)%/%7 + 1
<environment: namespace:lubridate>
The documentation states:
Weeks is the number of complete seven day periods that have occured between the date and January 1st, plus one.
But since Jan 1 is the first day of the year (not the zeroth), the first "week" will be a six day period. The code should (??) be
(yday(x)-1)%/%7 + 1
NB: You are using week(...) in the data.table package, which is the same code as lubridate::week except it coerces everything to integer rather than numeric for efficiency. So this function has the same problem (??).
if you want to get the week number with the year use: "%Y-W%V":
e.g yearAndweeks <- strftime(dates, format = "%Y-W%V")
so
> strftime(c("2014-03-16", "2014-03-17","2014-03-18", "2014-01-01"), format = "%Y-W%V")
becomes:
[1] "2014-W11" "2014-W12" "2014-W12" "2014-W01"
If you want to get the week number with the year, Grant Shannon's solution using strftime works, but you need to make some corrections for the dates around january 1st. For instance, 2016-01-03 (yyyy-mm-dd) is week 53 of year 2015, not 2016. And 2018-12-31 is week 1 of 2019, not of 2018. This codes provides some examples and a solution. In column "yearweek" the years are sometimes wrong, in "yearweek2" they are corrected (rows 2 and 5).
library(dplyr)
library(lubridate)
# create a testset
test <- data.frame(matrix(data = c("2015-12-31",
"2016-01-03",
"2016-01-04",
"2018-12-30",
"2018-12-31",
"2019-01-01") , ncol=1, nrow = 6 ))
# add a colname
colnames(test) <- "date_txt"
# this codes provides correct year-week numbers
test <- test %>%
mutate(date = as.Date(date_txt, format = "%Y-%m-%d")) %>%
mutate(yearweek = as.integer(strftime(date, format = "%Y%V"))) %>%
mutate(yearweek2 = ifelse(test = day(date) > 7 & substr(yearweek, 5, 6) == '01',
yes = yearweek + 100,
no = ifelse(test = month(date) == 1 & as.integer(substr(yearweek, 5, 6)) > 51,
yes = yearweek - 100,
no = yearweek)))
# print the result
print(test)
date_txt date yearweek yearweek2
1 2015-12-31 2015-12-31 201553 201553
2 2016-01-03 2016-01-03 201653 201553
3 2016-01-04 2016-01-04 201601 201601
4 2018-12-30 2018-12-30 201852 201852
5 2018-12-31 2018-12-31 201801 201901
6 2019-01-01 2019-01-01 201901 201901
I think the problem is that the week calculation somehow uses the first day of the year. I don't understand the internal mechanics, but you can see what I mean with this example:
library(data.table)
dd <- seq(as.IDate("2013-12-20"), as.IDate("2014-01-20"), 1)
# dd <- seq(as.IDate("2013-12-01"), as.IDate("2014-03-31"), 1)
dt <- data.table(i = 1:length(dd),
day = dd,
weekday = weekdays(dd),
day_rounded = round(dd, "weeks"))
## Now let's add the weekdays for the "rounded" date
dt[ , weekday_rounded := weekdays(day_rounded)]
## This seems to make internal sense with the "week" calculation
dt[ , weeknumber := week(day)]
dt
i day weekday day_rounded weekday_rounded weeknumber
1: 1 2013-12-20 Friday 2013-12-17 Tuesday 51
2: 2 2013-12-21 Saturday 2013-12-17 Tuesday 51
3: 3 2013-12-22 Sunday 2013-12-17 Tuesday 51
4: 4 2013-12-23 Monday 2013-12-24 Tuesday 52
5: 5 2013-12-24 Tuesday 2013-12-24 Tuesday 52
6: 6 2013-12-25 Wednesday 2013-12-24 Tuesday 52
7: 7 2013-12-26 Thursday 2013-12-24 Tuesday 52
8: 8 2013-12-27 Friday 2013-12-24 Tuesday 52
9: 9 2013-12-28 Saturday 2013-12-24 Tuesday 52
10: 10 2013-12-29 Sunday 2013-12-24 Tuesday 52
11: 11 2013-12-30 Monday 2013-12-31 Tuesday 53
12: 12 2013-12-31 Tuesday 2013-12-31 Tuesday 53
13: 13 2014-01-01 Wednesday 2014-01-01 Wednesday 1
14: 14 2014-01-02 Thursday 2014-01-01 Wednesday 1
15: 15 2014-01-03 Friday 2014-01-01 Wednesday 1
16: 16 2014-01-04 Saturday 2014-01-01 Wednesday 1
17: 17 2014-01-05 Sunday 2014-01-01 Wednesday 1
18: 18 2014-01-06 Monday 2014-01-01 Wednesday 1
19: 19 2014-01-07 Tuesday 2014-01-08 Wednesday 2
20: 20 2014-01-08 Wednesday 2014-01-08 Wednesday 2
21: 21 2014-01-09 Thursday 2014-01-08 Wednesday 2
22: 22 2014-01-10 Friday 2014-01-08 Wednesday 2
23: 23 2014-01-11 Saturday 2014-01-08 Wednesday 2
24: 24 2014-01-12 Sunday 2014-01-08 Wednesday 2
25: 25 2014-01-13 Monday 2014-01-08 Wednesday 2
26: 26 2014-01-14 Tuesday 2014-01-15 Wednesday 3
27: 27 2014-01-15 Wednesday 2014-01-15 Wednesday 3
28: 28 2014-01-16 Thursday 2014-01-15 Wednesday 3
29: 29 2014-01-17 Friday 2014-01-15 Wednesday 3
30: 30 2014-01-18 Saturday 2014-01-15 Wednesday 3
31: 31 2014-01-19 Sunday 2014-01-15 Wednesday 3
32: 32 2014-01-20 Monday 2014-01-15 Wednesday 3
i day weekday day_rounded weekday_rounded weeknumber
My workaround is this function:
https://github.com/geneorama/geneorama/blob/master/R/round_weeks.R
round_weeks <- function(x){
require(data.table)
dt <- data.table(i = 1:length(x),
day = x,
weekday = weekdays(x))
offset <- data.table(weekday = c('Sunday', 'Monday', 'Tuesday', 'Wednesday',
'Thursday', 'Friday', 'Saturday'),
offset = -(0:6))
dt <- merge(dt, offset, by="weekday")
dt[ , day_adj := day + offset]
setkey(dt, i)
return(dt[ , day_adj])
}
Of course, you can easily change the offset to make Monday first or whatever. The best way to do this would be to add an offset to the offset... but I haven't done that yet.
I provided a link to my simple geneorama package, but please don't rely on it too much because it's likely to change and not very documented.
Using only base, I wrote the following function.
Note:
Assumes Mon is day number 1 in the week
First week is week 1
Returns 0 if week is 52 from last year
Fine-tune to suit your needs.
findWeekNo <- function(myDate){
# Find out the start day of week 1; that is the date of first Mon in the year
weekday <- switch(weekdays(as.Date(paste(format(as.Date(myDate),"%Y"),"01-01", sep = "-"))),
"Monday"={1},
"Tuesday"={2},
"Wednesday"={3},
"Thursday"={4},
"Friday"={5},
"Saturday"={6},
"Sunday"={7}
)
firstMon <- ifelse(weekday==1,1, 9 - weekday )
weekNo <- floor((as.POSIXlt(myDate)$yday - (firstMon-1))/7)+1
return(weekNo)
}
findWeekNo("2017-01-15") # 2

Resources